Data validation reference¶
Overview¶
In dbt, every data validation test is a SQL query meant to select rows that fail the test. A successful test will return no results. The query can be parametrized so it can be reused across multiple tables. dbt includes a few built-in data test definitions, and the dbt_expectations package provides many more. We also define our own custom data tests.
Todo
We should chat about the minimum level of data validation that we expect for a table.
All dbt tests should pass meaningfully on both the Full ETL and Fast ETL outputs, so that we have a chance of catching data issues in CI before the nightly builds. The one exception to this is the row count checks – they only apply to the Full ETL outputs.
All tables should have row count checks, unless they have a non-deterministic number of rows. Tables with non-deterministic row counts should have their length checked with expect_table_row_count_to_be_between
There should be no entirely null columns in the Full ETL outputs. For tables that contain deprecated columns with no data in the recent years processed by the Fast ETL, per-year nullness expectations should be added. See the
pudl.scripts.pudl_null_colsscript.What else?
Common workflows¶
Typical workflow: tweaking an asset, remateralize it in Dagster, re-run the data validations that pertain to just that table.
Often also useful to rematerialize the changed table and all of its downstream dependencies, and then run the data validations on all of those downstream dependencies to see if there were any unforeseen consequences.
PUDL Specific Design Choices¶
Our usage of dbt is slightly unusual, since we rely on Dagster to coordinate our data pipeline, and are only using dbt for data validation. Some quirks of our setup to be aware of:
From dbt’s point of view, the PUDL tables are all sources – external tables about which it knows very little other than the table and column names. It assumes the tables will be available, rather than trying to create them. In a typical dbt project, most tables would be defined as models which are somewhat analogous to Dagster assets.
As a SQL-based tool, dbt generally expects to be querying a database. However, in our case the tables are stored as Apache Parquet files, which we query with SQL via DuckDB. This means some of dbt’s functionality is not available. For example, we can’t use the dbt adapter object in our test definitions because it relies on being able to access the underlying database schema.
One place we use true dbt models instead of sources is when we define intermediate tables to simplify test definitions. See Creating intermediate tables for a test. These intermediate tables are created as materialized views in a DuckDB database at
$PUDL_OUTPUT/pudl_dbt_tests.duckdb. In this case, the underlying database schema will be accessible to dbt. Additionally, any time you need to refer to those tables while debugging, you’ll need to be connected to that database.
Branch builds¶
Depending on your computer, running the full ETL locally can be extremely time consuming and may run into memory limits. It’s also easy to accidentally end up with local outputs that are the result of code from multiple different branches, and so may not be consistent with each other. If you’re only altering a few tables, rematerializing them in Dagster and then running the specific dbt tests that apply to them and any tables downstream of them should work fine.
Kicking off a branch build¶
When we’re doing big quarterly or annual updates, and dozens or hundreds of tables are changing simultaneously, it is helpful to be able to run the full ETL from scratch, run all of the data validation tests against the outputs, and use the results to update the test parameters (especially expected row counts) appropriately. This can be done by manually kicking off a PUDL deployment on your branch.
To initiate a branch build, in the PUDL repo on GitHub go to Actions and select build-deploy-pudl.
On the right hand side select Run Workflow and then select your branch in the dropdown
and click the Run Workflow button. Shortly thereafter you should see a notification in
the pudl-deployments channel in our Slack saying that the build has kicked off. It
should take about 3 hours to complete. You can track its progress and watch the logs in
the Google Cloud Console.
Getting fresh row counts from a branch build¶
To catch unexpected changes to the data, we keep track of the expected number of rows in
each data table we distribute. These expectations are stored in
dbt/seeds/etl_full_row_counts.csv and they can be updated using the dbt_helper
script based on the observed row counts in your local PUDL Parquet outputs. If you can’t
run the full ETL locally, the nightly builds / branch build also generate updated row
count expectations. After a branch build completes, you can download the updated
etl_full_row_counts.csv file from the build outputs that are uploaded to
gs://builds.catalyst.coop/<build-id>/etl_full_row_counts.csv See the
Nightly Data Builds documentation for more details on accessing the nightly build
outputs. Replace the etl_full_row_counts.csv in your local PUDL git repo with the
one you’ve downloaded and use git diff to see what has changed. Make sure to review
the row count changes closely to see if there’s anything unexpected.
Running dbt directly¶
dbt has its own much more extensive documentation. PUDL uses only a small subset of its features.
To run all of the data validation tests, from within the dbt/ directory run:
dbt build
For more fine-grained control, you can use the --select option to run only the tests
defined for a particular table, or all instances of a particular test no matter what
table it’s associated with. Or you can combine the two to run just a particular test
on a particular table. Some examples:
# Run all tests defined for the out_eia__monthly_generators table
dbt build --select "source:pudl.out_eia__monthly_generators"
# Run all instances of the expect_columns_not_all_null test
dbt build --select "test_name:expect_columns_not_all_null"
# Run expect_columns_not_all_null test on the out_eia__monthly_generators table only
dbt build --select "test_name:expect_columns_not_all_null,source:pudl.out_eia__monthly_generators"
# Use a wildcard "*" to run all tests on tables whose names start with out_eia923__
dbt build --select "source:pudl.out_eia923__*"
Similarly, you can exclude individual tables or tests using --exclude. One case
where this is useful is running the data validation tests against the outputs of the
fast ETL. We do not store expected row-counts for the fast ETL outputs, and so generally
expect the row-count checks to fail. To run all of the data validation tests except for
the row counts and avoid seeing all those spurious failures you could run:
dbt build --exclude "test_name:check_row_counts_per_partition"
For more options, see the dbt selection syntax documentation.
Note
The dbt tests can be run in parallel to speed them up with the --threads argument
but this sometimes results in spurious errors like “too many files open” which are
not related to the data being tested.
Note
There are a handful of data validation tests that have been implemented using
Dagster’s asset checks.
Typically these tests weren’t well suited to SQL, weren’t performance bottlenecks,
and had already been implemented in Python. E.g. pudl.validate.no_null_rows().
Data validation in our integration tests¶
The dbt data tests are invoked by pytest as part of our integration tests. This
means they run as part of our continuous integration (CI) checks before a PR can be
merged into main. However, the CI only processes 1-2 years of data, so when the
tests run in CI, they’re only checking a small subset of the data we publish. We also
don’t run the row count checks in CI, since the fast ETL outputs are more changeable
and less informative than those in the nightly builds.
This means that when you’re developing a new table or updating an existing table, it’s
important to manually run the dbt tests on the new data in its entirety before the
changes are merged into main.
If the data validations fail in the pytest integration tests, they should produce
helpful output indicating what failed and why, in the same way as dbt_helper
validate
Debugging data validation failures¶
So, you’ve run the data validations, but one or more of them has failed. Now what?
We’ll go over some general strategies first, then look at the two most common failures: row counts and quantiles.
General strategies and tools¶
Use dbt_helper validate¶
If you habitually use dbt, try using dbt_helper validate instead – it
will print out additional context information to help with debugging failures.
Example: validation ``expect_columns_not_all_null`` on table ``out_eia__yearly_generators``
If you run this validation in dbt, it tells you:
the test failed
there was 1 failure row
the compiled SQL query for the test is at
target/compiled/pudl_dbt/models/eia/out_eia__yearly_generators/schema.yml/source_expect_columns_not_all__790ceaac9ad08187ce2e9323e6b58961.sql
& that’s it:
$ dbt build --select source:pudl.out_eia__yearly_generators,test_name:expect_columns_not_all_null
20:38:56 Running with dbt=1.10.6
20:38:57 Registered adapter: duckdb=1.9.4
20:38:57 Unable to do partial parsing because config vars, config profile, or config target have changed
20:39:00 Found 2 models, 700 data tests, 1 seed, 240 sources, 850 macros
20:39:00
20:39:00 Concurrency: 1 threads (target='etl-full')
20:39:00
20:39:00 1 of 1 START test source_expect_columns_not_all_null_pudl_out_eia__yearly_generators_False__EXTRACT_year_FROM_report_date_2008__[...]EXTRACT_year_FROM_report_date_2009 [RUN]
20:39:01 1 of 1 FAIL 1 source_expect_columns_not_all_null_pudl_out_eia__yearly_generators_False__EXTRACT_year_FROM_report_date_2008__[...]EXTRACT_year_FROM_report_date_2009 [FAIL 1 in 0.15s]
20:39:01
20:39:01 Finished running 1 test in 0 hours 0 minutes and 0.21 seconds (0.21s).
20:39:01
20:39:01 Completed with 1 error, 0 partial successes, and 0 warnings:
20:39:01
20:39:01 Failure in test source_expect_columns_not_all_null_pudl_out_eia__yearly_generators_False__EXTRACT_year_FROM_report_date_2008__[...]EXTRACT_year_FROM_report_date_2009 (models/eia/out_eia__yearly_generators/schema.yml)
20:39:01 Got 1 result, configured to fail if != 0
20:39:01
20:39:01 compiled code at target/compiled/pudl_dbt/models/eia/out_eia__yearly_generators/schema.yml/source_expect_columns_not_all__790ceaac9ad08187ce2e9323e6b58961.sql
20:39:01
20:39:01 Done. PASS=0 WARN=0 ERROR=1 SKIP=0 NO-OP=0 TOTAL=1
It doesn’t tell you what the failure row was; you’d have to run the compiled query yourself to figure that out (see below for details on what that means and how to do it).
If you run this validation in dbt_helper, it shows you the dbt output, but
it also runs the compiled SQL query and gives you the results:
$ dbt_helper validate --select source:pudl.out_eia__yearly_generators,test_name:expect_columns_not_all_null
[...]
20:37:49 Finished running 1 test in 0 hours 0 minutes and 0.17 seconds (0.17s).
20:37:49
20:37:49 Completed with 1 error, 0 partial successes, and 0 warnings:
20:37:49
20:37:49 Failure in test source_expect_columns_not_all_null_pudl_out_eia__yearly_generators_False__EXTRACT_year_FROM_report_date_2008__[...]EXTRACT_year_FROM_report_date_2009 (models/eia/out_eia__yearly_generators/schema.yml)
20:37:49 Got 1 result, configured to fail if != 0
20:37:49
20:37:49 compiled code at target/compiled/pudl_dbt/models/eia/out_eia__yearly_generators/schema.yml/source_expect_columns_not_all__790ceaac9ad08187ce2e9323e6b58961.sql
20:37:49
20:37:49 Done. PASS=0 WARN=0 ERROR=1 SKIP=0 NO-OP=0 TOTAL=1
Traceback (most recent call last):
[...]
AssertionError: failure contexts:
source_expect_columns_not_all_null_pudl_out_eia__yearly_generators_False__EXTRACT_year_FROM_report_date_2008__[...]EXTRACT_year_FROM_report_date_2009:
| table_name | failing_column | failure_reason | row_condition | total_rows_matching_condition | non_null_count |
|:---------------------------|:-----------------|:---------------------------------------|:--------------------------------------|--------------------------------:|-----------------:|
| out_eia__yearly_generators | unit_id_pudl | Conditional check failed: EXTRACT(year | EXTRACT(year FROM report_date) < 2008 | 136918 | 0 |
| | | FROM report_date) < 2008 | | | |
This saves you a step. Most times, this is enough to figure out what has gone wrong, and you never need to look at the compiled SQL query at all.
Inspect the SQL query for the test and run it yourself¶
Some tests have very long failure row output that dbt_helper
doesn’t handle very well. To debug those failures, you will need to
run the SQL yourself, and explore duckdb’s output formatting options to display
the results legibly.
Some tests have terrible failure row output that doesn’t tell you anything useful. To debug those failures, you will need to bust into the SQL to pull out enough information to figure out what went wrong.
Both of these cases require you to touch the compiled SQL query for the test directly.
dbt gives you a path for the “compiled code” for the failing test, in a line that looks like this:
20:37:49 compiled code at target/compiled/pudl_dbt/models/eia/out_eia__yearly_generators/schema.yml/source_expect_columns_not_all__790ceaac9ad08187ce2e9323e6b58961.sql
“Compiled” is important here because the source code for each test is merely
a template. The template cannot directly be used to query the database. Each
instance of a test is configured in the schema.yml for the table being tested.
dbt compiles the SQL query for that test by filling in the template values using
information from the test config. It saves the resulting SQL query to a new file
in the target/compiled directory. This is the compiled query.
Running this query in duckdb will generate the failure row output. There are two ways to run the query.
Run once using the shell¶
You can run duckdb against the test database, and input the compiled code path
using <.
If you are in the pudl working directory, you may need to add dbt/ to
the front of the compiled code path. Like this:
$ duckdb $PUDL_OUTPUT/pudl_dbt_tests.duckdb <dbt/target/compiled/pudl_dbt/models/eia/out_eia__yearly_generators/schema.yml/source_expect_columns_not_all__790ceaac9ad08187ce2e9323e6b58961.sql
┌────────────────────────────┬────────────────┬─────────────────────────────────────────────────────────────────┬───────────────────────────────────────┬───────────────────────────────┬────────────────┐
│ table_name │ failing_column │ failure_reason │ row_condition │ total_rows_matching_condition │ non_null_count │
│ varchar │ varchar │ varchar │ varchar │ int64 │ int64 │
├────────────────────────────┼────────────────┼─────────────────────────────────────────────────────────────────┼───────────────────────────────────────┼───────────────────────────────┼────────────────┤
│ out_eia__yearly_generators │ unit_id_pudl │ Conditional check failed: EXTRACT(year FROM report_date) < 2008 │ EXTRACT(year FROM report_date) < 2008 │ 136918 │ 0 │
└────────────────────────────┴────────────────┴─────────────────────────────────────────────────────────────────┴───────────────────────────────────────┴───────────────────────────────┴────────────────┘
The advantage of this approach is that it is very quick, and it immediately returns you to a shell.
Variation: change output modes
To transpose very wide output, consider setting .mode. Using the -cmd
argument to duckdb will execute a command before processing input provided using
<. Like this:
$ duckdb -cmd '.mode line' $PUDL_OUTPUT/pudl_dbt_tests.duckdb <dbt/target/compiled/pudl_dbt/models/eia/out_eia__yearly_generators/schema.yml/source_expect_columns_not_all__790ceaac9ad08187ce2e9323e6b58961.sql
table_name = out_eia__yearly_generators
failing_column = unit_id_pudl
failure_reason = Conditional check failed: EXTRACT(year FROM report_date) < 2008
row_condition = EXTRACT(year FROM report_date) < 2008
total_rows_matching_condition = 136918
non_null_count = 0
There are lots of output format modes available; hopefully one of them will be legible for your failure row output!
Run inside a duckdb session¶
You can open a duckdb session against the test database, and input the compiled
code path using the duckdb .read command. Like this:
$ duckdb $PUDL_OUTPUT/pudl_dbt_tests.duckdb
v1.2.0 5f5512b827
Enter ".help" for usage hints.
D .read dbt/target/compiled/pudl_dbt/models/eia/out_eia__yearly_generators/schema.yml/source_expect_columns_not_all__790ceaac9ad08187ce2e9323e6b58961.sql
┌────────────────────────────┬────────────────┬─────────────────────────────────────────────────────────────────┬───────────────────────────────────────┬───────────────────────────────┬────────────────┐
│ table_name │ failing_column │ failure_reason │ row_condition │ total_rows_matching_condition │ non_null_count │
│ varchar │ varchar │ varchar │ varchar │ int64 │ int64 │
├────────────────────────────┼────────────────┼─────────────────────────────────────────────────────────────────┼───────────────────────────────────────┼───────────────────────────────┼────────────────┤
│ out_eia__yearly_generators │ unit_id_pudl │ Conditional check failed: EXTRACT(year FROM report_date) < 2008 │ EXTRACT(year FROM report_date) < 2008 │ 136918 │ 0 │
└────────────────────────────┴────────────────┴─────────────────────────────────────────────────────────────────┴───────────────────────────────────────┴───────────────────────────────┴────────────────┘
You can type other SQL queries and duckdb commands at the duckdb prompt as well.
The advantage of this approach is that you are at a database prompt, and can immediately run other queries to narrow down what has gone wrong.
The disadvantage of this approach is that you have to remember to quit (CTRL-D
or .quit) before you can run more dbt commands. duckdb does not like having
multiple programs accessing the database simultaneously.
Dealing with terrible failure row output¶
If the failure row output for a test says something useless like “false” with no other identifying information, you’ll need to actually read the SQL query and adapt some portion of it to give you the context you need.
Such as:
$ duckdb $PUDL_OUTPUT/pudl_dbt_tests.duckdb <target/compiled/pudl_dbt/models/eia/out_eia__yearly_generators/schema.yml/dbt_expectations_source_expect_33dc33ad0a260e896f11f41b4422dda8.sql
┌─────────────┐
│ expression │
│ boolean │
├─────────────┤
│ false │
│ false │
│ false │
│ false │
│ false │
│ false │
│ false │
│ false │
│ false │
│ false │
│ · │
│ · │
│ · │
│ false │
│ false │
│ false │
│ false │
│ false │
│ false │
│ false │
│ false │
│ false │
│ false │
├─────────────┤
│ 570499 rows │
│ (20 shown) │
└─────────────┘
The only thing this tells us is that there are 570,499 rows that failed the test. We need to know which rows they are in order to debug further.
Opening dbt_expectations_source_expect_33dc33ad0a260e896f11f41b4422dda8.sql
in a pager or text editor yields the following:
with grouped_expression as (
select
unit_id_pudl is not null as expression
from '/Users/catalyst/pudl_output/parquet/out_eia__yearly_generators.parquet'
),
validation_errors as (
select
*
from
grouped_expression
where
not(expression = true)
)
select *
from validation_errors
Here we’d have several options:
Add a few primary key columns to the
grouped_expressiontable, so that they pop out in the finalselectAdapt the inner
selectfromgrouped_expressionto work on its ownGrab only the parquet path and put it in a custom query like
select plant_id_eia, generator_id, report_date from {parquet path} where unit_id_pudl is null
The query you build using any of the above could be copied and pasted into a duckdb session, and the results interrogated further from there.
Debugging and fixing row count failures¶
Row count checks fail when the local data has a different number of rows in it than dbt expected.
There are two cases to consider:
you expect the local data to have a new number of rows
you expect the local data to have the same number of rows as before
If you expect the local data to have a new number of rows -
you’re adding a new year of data,
you’re changing how the data is filtered or dropped,
etc.
Then it’s a good thing if the row count check fails at first.
That means you have some new rows!
First, we can use dbt_helper to count up the new row counts,
and see how they differ from the old ones.
$ dbt_helper update-tables <TABLE_NAME> --row-counts --clobber # we use --clobber so the changes are actually written to disk
$ git diff # to see the difference
You may see that a row count for a partition has been added:
diff --git a/dbt/seeds/etl_full_row_counts.csv b/dbt/seeds/etl_full_row_counts.csv
index d9a5f0ec7..2b40f3ad7 100644
--- a/dbt/seeds/etl_full_row_counts.csv
+++ b/dbt/seeds/etl_full_row_counts.csv
@@ -3318,7 +3318,7 @@ out_ferc1__yearly_steam_plants_fuel_sched402,2020,1250
out_ferc1__yearly_steam_plants_fuel_sched402,2021,1152
out_ferc1__yearly_steam_plants_fuel_sched402,2022,1196
out_ferc1__yearly_steam_plants_fuel_sched402,2023,1210
+out_ferc1__yearly_steam_plants_fuel_sched402,2024,1221
out_ferc1__yearly_steam_plants_sched402,1994,1411
out_ferc1__yearly_steam_plants_sched402,1995,1448
out_ferc1__yearly_steam_plants_sched402,1996,1395
This is what you’d expect if you were adding a new year of data. In this case, we want to double-check if that is a reasonable number of rows for a new partition. This will be different for each table, but consider these heuristic questions:
How does this compare to previous years? Do I expect there to be more rows, fewer rows, or about the same number?
Have I made changes to how the data is filtered or merged with other data?
Did other partitions change, too, or did I just see a new partition?
If any of those questions raise alarm bells, you should probably look at the actual output data in a notebook. If, after investigation, you’re sure the row counts are correct, commit the changes to the expected row counts.
You might also get a row count test failure when you don’t expect it. That will probably look like a change in row count for one or more partitions:
diff --git a/dbt/seeds/etl_full_row_counts.csv b/dbt/seeds/etl_full_row_counts.csv
index d9a5f0ec7..2b40f3ad7 100644
--- a/dbt/seeds/etl_full_row_counts.csv
+++ b/dbt/seeds/etl_full_row_counts.csv
@@ -3318,7 +3318,7 @@ out_ferc1__yearly_steam_plants_fuel_sched402,2020,1250
out_ferc1__yearly_steam_plants_fuel_sched402,2021,1152
out_ferc1__yearly_steam_plants_fuel_sched402,2022,1196
-out_ferc1__yearly_steam_plants_fuel_sched402,2023,1210
+out_ferc1__yearly_steam_plants_fuel_sched402,2023,1215
-out_ferc1__yearly_steam_plants_fuel_sched402,2024,1224
+out_ferc1__yearly_steam_plants_fuel_sched402,2024,1221
out_ferc1__yearly_steam_plants_sched402,1994,1411
out_ferc1__yearly_steam_plants_sched402,1995,1448
out_ferc1__yearly_steam_plants_sched402,1996,1395
If you don’t expect your code to have caused these row-count changes, it’s time to investigate. There’s likely a bug somewhere. Investigation will be different for each table, but here are some ideas to get you started:
Check that everything is up to date - do you have the latest changes from
mainin your branch? Did you re-materialize your asset using fresh upstream data?Compare your local data to the data in the last nightly build: Using the primary key, merge the two tables and see what rows are in both, what rows are only in the nightly build, and what rows are only in your local build.
Once you understand why the row counts are different, either fix the bug or commit the new expected row counts.
Debugging quantile checks¶
Run the quantile check by selecting the table you want to check.
If you want to check all the tables, you can instead select all the quantile checks.
Note that we’re using --select to use dbt selection syntax,
not --asset-select for Dagster selection syntax.
$ dbt_helper validate --select "test_name:expect_quantile_constraints"
In this example, we’re running quantile checks for out_eia__yearly_generators.
$ dbt_helper validate --select "source:pudl_dbt.pudl.out_eia__yearly_generators"
[...]
18:39:46 Finished running 24 data tests in 0 hours 0 minutes and 1.01 seconds (1.01s).
18:39:46
18:39:46 Completed with 1 error, 0 partial successes, and 0 warnings:
18:39:46
18:39:46 Failure in test source_expect_quantile_constraints_pudl_out_eia__yearly_generators_capacity_factor___quantile_0_65_min_value_0_5_max_value_0_6____quantile_0_15_min_value_0_005____quantile_0_95_max_value_0_95___fuel_type_code_pudl_gas_and_report_date_CAST_2015_01_01_AS_DATE_and_capacity_factor_0_0__capacity_mw (models/eia/out_eia__yearly_generators/schema.yml)
18:39:46 Got 1 result, configured to fail if != 0
18:39:46
18:39:46 compiled code at target/compiled/pudl_dbt/models/eia/out_eia__yearly_generators/schema.yml/source_expect_quantile_constra_392a2df5d1590fb6bc46821e0b879c86.sql
18:39:46
18:39:46 Done. PASS=23 WARN=0 ERROR=1 SKIP=0 NO-OP=0 TOTAL=24
Traceback (most recent call last):
File "/Users/catalyst/bin/miniforge3/envs/pudl-dev/bin/dbt_helper", line 7, in <module>
sys.exit(dbt_helper())
~~~~~~~~~~^^
File "/Users/catalyst/bin/miniforge3/envs/pudl-dev/lib/python3.13/site-packages/click/core.py", line 1161, in __call__
return self.main(*args, **kwargs)
~~~~~~~~~^^^^^^^^^^^^^^^^^
File "/Users/catalyst/bin/miniforge3/envs/pudl-dev/lib/python3.13/site-packages/click/core.py", line 1082, in main
rv = self.invoke(ctx)
File "/Users/catalyst/bin/miniforge3/envs/pudl-dev/lib/python3.13/site-packages/click/core.py", line 1697, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^
File "/Users/catalyst/bin/miniforge3/envs/pudl-dev/lib/python3.13/site-packages/click/core.py", line 1443, in invoke
return ctx.invoke(self.callback, **ctx.params)
~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/catalyst/bin/miniforge3/envs/pudl-dev/lib/python3.13/site-packages/click/core.py", line 788, in invoke
return __callback(*args, **kwargs)
File "/Users/catalyst/Documents/work/catalyst/pudl/src/pudl/scripts/dbt_helper.py", line 673, in validate
raise AssertionError(
f"failure contexts:\n{test_result.format_failure_contexts()}"
)
AssertionError: failure contexts:
source_expect_quantile_constraints_pudl_out_eia__yearly_generators_capacity_factor___quantile_0_65_min_value_0_5_max_value_0_6____quantile_0_15_min_value_0_005____quantile_0_95_max_value_0_95___fuel_type_code_pudl_gas_and_report_date_CAST_2015_01_01_AS_DATE_and_capacity_factor_0_0__capacity_mw:
table: source.pudl_dbt.pudl.out_eia__yearly_generators
test: expect_quantile_constraints
column: capacity_factor
row_condition: fuel_type_code_pudl='gas' and report_date>=CAST('2015-01-01' AS DATE) and capacity_factor<>0.0
weight column: capacity_mw
description: Historical note, EIA natural gas reporting really only becomes usable in 2015.
quantile | value | min | max
0.65 | 0.4638245 | 0.500 | 0.60
0.15 | 0.0246494 | 0.005 | None
0.95 | 0.7754576 | None | 0.95
In this example, quantile 0.65 was expected to be between 0.5 and 0.6, but was instead 0.46, outside of the expected range.
Locate the quantile check in the table’s schema.yml file.
This will be at dbt/models/<data_source>/<table_name>/schema.yml.
Find the column name and the row condition in the failure output.
In this example, the check we want is for column capacity_factor,
and it’s the entry with the row condition
fuel_type_code_pudl='gas'
and report_date>=CAST('2015-01-01' AS DATE)
and capacity_factor<>0.0.
[pudl/dbt] $ $EDITOR models/eia/out_eia__monthly_generators/schema.yml
Depending on the situation, from here you can:
investigate further in a Python notebook - how has this data changed from the version in the nightly builds which passed this check?
ask folks if we expect these quantiles to have shifted
fix a bug, re-run the pipeline, and repeat the check
adjust the quantile constraints (& consider leaving a dated note for followup in case it gets worse)
Applying pre-defined validations to existing data¶
Applying an existing generic test to an existing table should be as easy as editing
the schema.yml file associated with that table, and adding a new test specification
to the data_tests section of either the table as a whole or an individual column.
The schema.yml for table_name can be found at
dbt/models/{data_source}/{table_name}/schema.yml.
In general, table-level tests depend on multiple columns or test some property of the table as a whole, while column-level tests typically depend only on values with the column they are applied to.
Pre-defined tests¶
Our dbt project includes dbt-utils and
dbt-expectations as dependencies.
These packages include a bunch of useful tests that can be applied to any table.
There are several examples of applying tests from dbt-expectations in
dbt/models/vcerare/out_vcerare__hourly_available_capacity_factor/schema.yml
and in general they will look like the below. Each item in a data_tests section
defines a single test, and may provide named parameters for the test. The tests whose
names have the dbt_expectations prefix come from that package.
version: 2
sources:
- name: pudl
tables:
- name: out_vcerare__hourly_available_capacity_factor
data_tests:
- expect_columns_not_all_null
- check_row_counts_per_partition:
arguments:
table_name: out_vcerare__hourly_available_capacity_factor
partition_expr: report_year
- expect_valid_hour_of_year
- expect_unique_column_combination:
arguments:
columns:
- county_id_fips
- datetime_utc
columns:
- name: state
data_tests:
- not_null
- name: place_name
data_tests:
- not_null
- dbt_expectations.expect_column_values_to_not_be_in_set:
arguments:
value_set:
- bedford_city
- clifton_forge_city
- lake_hurron
- lake_st_clair
- dbt_expectations.expect_column_values_to_be_in_set:
arguments:
value_set:
- oglala lakota
row_condition: "county_id_fips = '46012'"
- name: datetime_utc
data_tests:
- not_null
- dbt_expectations.expect_column_values_to_not_be_in_set:
arguments:
value_set:
- "{{ dbt_date.date(2020, 12, 31) }}"
- name: report_year
data_tests:
- not_null
- name: hour_of_year
data_tests:
- not_null
- dbt_expectations.expect_column_max_to_be_between:
arguments:
min_value: 8760
max_value: 8760
Tests defined within PUDL¶
Some of the tests in the example above like expect_columns_not_all_null or
check_row_counts_per_partition are defined by us, and can be found in the SQL
files with the same name under dbt/tests/data_tests/generic_tests/
Documentation for the tests that we define is in
dbt/tests/data_tests/generic_tests/schema.yml
Todo
Integrate documentation of our existing generic tests into the docs build.
Adding new tables¶
The tables that exist within PUDL are defined by the data structures within
pudl.metadata.resources. Any Dagster asset that’s being written out to Parquet
or the PUDL SQLite database needs to be defined there. The schema.yml files within
our dbt project are derived from that same PUDL metadata. Our unit tests check to make
sure that the dbt schemas haven’t drifted away from the canonical PUDL metadata. To make
sure that the two sets of database table descriptions stay in sync, we try to create and
update the dbt schemas programmatically when possible.
Using dbt_helper update-tables¶
To add a new PUDL table to the dbt project, you must add it as a dbt
source. The dbt_helper script
automates the initial setup with the update-tables subcommand.
Before adding a table as a dbt source, you need to:
define that table as a resource in
pudl.metadata.resourcesmake sure that table is written out to Parquet
Then you can use the dbt_helper update-tables command to initialize the file.
dbt_helper update-tables --schema new_table_name
This will add a file called dbt/models/{data_source}/new_table_name/schema.yml. You
can also give it a list of tables and they will all be created at once. This yaml file
tells dbt about the table and its schema, but initially it will not have any data
validations defined. Tests need to be added by hand.
Initial data tests¶
There are a few tests that we apply to every table,
which should be defined as soon as you’ve added a new table.
These include check_row_counts_by_partition and expect_columns_not_all_null.
We talk about check_row_counts_by_partition in Updating row counts.
Checking for entirely null columns¶
A test we apply to basically all tables is expect_columns_not_all_null. In
its most basic form it verifies that there are no columns in the table which are
completely null, since that is typically indicative of a bad ENUM constraint, a
column naming error, or a bad merge, and should be investigated. To add this basic
default, you add the test to the table level data_tests with no parameters:
version: 2
sources:
- name: pudl
tables:
- name: new_table_name
data_tests:
- expect_columns_not_all_null
Defining new data validation tests¶
Sometimes you will want to test a property that can’t be expressed
using the existing dbt tests like check_row_counts_per_partition (in
dbt/tests/data_tests/generic_tests) or the tests in dbt_expectations or dbt_utils.
In those cases you’ll need to define a new type of data validation test using dbt!
Writing tests in dbt means they’ll be located next to all the other data validation we’re defining in the dbt schemas, which is nice. They also tend to be quite performant.
In a few rare cases you may need to write the check with access to all of the tools within Python. In those cases, you can use Dagster’s asset checks, but in general we prefer using dbt tests.
How do I write a new dbt test?¶
A dbt test is a templated SQL query that runs on your output data to look for problems in the data. The query should be designed to return no rows if there are no problems with the data. If the query returns any rows at all, then the test will fail.
The test will need to live as a templated piece of SQL within
pudl/dbt/tests/data_tests/generic_tests. dbt has official docs for
doing this, but the core steps are:
Check to see whether the test you need is already provided by dbt-utils or dbt-expectations.
Make a file called
pudl/dbt/tests/data_tests/generic_tests/your_test.sql.Add
{% test your_test(some_test_params...) %}to the top of the file and{% endtest %}to the end. By default, if a test is defined at the table level, it will receive themodelparameter; if it’s defined at the column level, it will receive bothmodelandcolumnparameters; and you can add more custom parameters in the test signature which will be read out of the schema YAML.Write a SQL
SELECTstatement that returns any data that would fail your test, as well as useful debugging information. See our existing tests indbt/tests/data_tests/generic_teststo see some common patterns.dbt/tests/data_tests/generic_tests/expect_consistent_years.sqlmay be of particular use as a simple example that returns useful debugging context along with the failing rows.
If you’re not already familiar with SQL, here are some useful resources:
SQL for Data Scientists (book)
Note
Refer to PUDL Specific Design Choices above for an explanation of some details of our dbt setup that may affect what functionality is available when writing new tests.
Testing your new test¶
OK, now you have a new test, which seems to be working. How can we check to make sure it’s doing what we want?
dbt has robust macro testing tools, and tests are basically macros, but unfortunately you still have to jump through a couple hoops:
Pull the test logic out into a macro
Use the test as a very thin wrapper around the logic macro
Test the logic macro
First, we pull the test logic out into a macro (let’s call it your_macro()):
Move the test file from above to
pudl/dbt/macros/your_macro.sqlReplace
{% test your_macro(...) %}with{% macro your_macro(...) %}Replace
{% endtest %}with{% endmacro %}
The logic macro is now available to use in tests. Next, use the test as a wrapper around the logic macro you just wrote. Make the test file read like this:
{% test your_test(model, custom_param) %}
{{ your_macro(model, custom_param) }}
{% endtest %}
This makes it a very simple wrapper that allows the test logic to be accessed
from a data_tests block within the schema.
Finally, write a test in pudl/dbt/tests/unit_tests/test_your_macro.sql. This
SQL file doesn’t need any special {% ... %} stuff in it.
The structure is easiest to explain with an example. Let’s walk through a test that checks if the row-counts macro is working as expected:
WITH test_row_counts AS (
SELECT * FROM (VALUES
('test_table', 2022, 1),
('test_table', 2023, 1),
) AS t(table_name, partition, row_count)
),
Here, test_row_counts is setting up the expected row counts per partition.
We use that SELECT * FROM (VALUES construction to make a temporary SQL table
with that literal data - 2 rows saying that “test_table should have 1 row in
2022 and 1 in 2023”. Continuing on:
test_table AS (
SELECT * FROM (VALUES
(2022, 'x'),
(2023, 'x'),
) AS t(report_year, dummy_col)
),
Here, we define test_table, the actual table we’re counting rows for. You
can see we’ve added one row for 2022 and one for 2023 - so we expect the test
to pass! Next:
expected_mismatch_counts as (
SELECT * FROM (VALUES
('test_table', 0),
) AS t(table_name, num_mismatches)
),
We’re saying here that expected_mismatch_counts is 0 - there are no
partitions where we expect there to be a mismatch. Next, we call the macro:
result_comparison AS (
SELECT (SELECT COUNT(*)
FROM ({{
row_counts_per_partition('test_table', 'test_table', 'report_year', force_row_counts_table='test_row_counts')
}})) as observed_mismatch_count,
num_mismatches AS expected_mismatch_count,
FROM expected_mismatch_counts
)
This one is a bit more complicated.
Let’s start from the macro call {{ row_counts_per_partition(...) }}. This
gets us one row per partition that has a mismatched number of rows between
the expected row counts (test_row_counts) and the observed row counts in
test_table.
Then we wrap that in SELECT COUNT(*) which tells us how many rows that macro
call returned (in this case, 0).
Finally, we wrap that in SELECT (SELECT COUNT(*) FROM ...) as
observed_mismatch_count .... That makes a table where the columns are the
observed mismatch count (0, as counted by the macro) and the expected mismatch
count (directly pulled from the expected_mismatch_counts table we set up
earlier). Finally we are ready to actually run the top-level SELECT - much
like other tests, we are looking for problem rows - if the SELECT returns 0
rows that means a passing test:
SELECT *
FROM result_comparison
WHERE observed_mismatch_count != expected_mismatch_count
So if we observe a different number of mismatched partitions than what we expect, this test will fail. We can repeat this structure with different input data to cover many different use cases of the macro.
If the test is particularly weird and hard to get good debug
info for, you can add custom debug handlers for your test type in
pudl.dbt_wrapper.build_with_context(), which gives you access to the full
power of Python.
Creating intermediate tables for a test¶
Sometimes you’ll need to do a test in two steps. For example, if you want to
use a column test (such as expect_quantile_constraints) on the ratio of two
columns, you will need to calculate that ratio as a separate column.
This can be done by creating a new dbt model that materializes an
intermediate table you want to execute tests on. Add a SQL file to
dbt/models/{data_source}/{source_table_name}/{intermediate_table_name}.sql
containing a SELECT statement that builds your new table. For
example, if you need to divide the source_table_name.a column by
source_table_name.b:
select
a / b as my_ratio
from {{ source('pudl', 'source_table_name') }}
Then add the model to the schema.yml file under the models top-level
key, and define tests exactly as you would for a source table. See
models/ferc1/out_ferc1__yearly_steam_plants_fuel_by_plant_sched402 for an
example of
this pattern.
Note: when adding a model, it will be stored as a SQL view in the file
$PUDL_OUTPUT/pudl_dbt_tests.duckdb.