Data validation quickstart#

Setup#

The dbt/ directory contains the PUDL dbt project which manages our data tests. dbt is part of the PUDL pixi environment.

The data validation tests run on the Parquet outputs that are in your $PUDL_OUTPUT/parquet/ directory. It’s important that you ensure the outputs you’re testing are actually the result of the code on your current branch, otherwise you may be surprised when the data test passes locally but fails in CI or the nightly builds.

We have a script, pudl.scripts.dbt_helper, to help with some common workflows.

Running the data validation tests#

dbt_helper validate runs the data validation tests. It’s a wrapper around the official dbt tool, dbt build - see Running dbt directly.

dbt_helper validate provides rich output when a test fails, and allows us to use the Dagster asset selection syntax.

Example usage:

# for *all* assets
dbt_helper validate
# for just a single asset
dbt_helper validate --asset-select "key:out_eia__yearly_generators"
# for this asset as well as all upstream assets
dbt_helper validate --asset-select "+key:out_eia__yearly_generators"
# same as above, but skip row counts
dbt_helper validate --asset-select "+key:out_eia__yearly_generators" --exclude "*check_row_counts*"

See dbt_helper validate --help for usage details.

Tip

You may want to run the validation tests against multiple sets of Parquet files.

To do this:

  1. Download the Parquet files to <any_directory_you_want>/parquet/.

  2. Set the PUDL_OUTPUT environment variable to <any_directory_you_want>. (note use an absolute path!)

  3. Run any of the dbt_helper commands you need.

Some examples of useful Parquet outputs and where to find them:

  • the most recent nightly builds

  • the fast ETL outputs from your integration tests: these are in a temporary directory created by pytest. Since these are already on your computer you don’t need to download them. The path is printed out at the beginning of the pytest run and will look like: 2025-07-25 16:05:49 [    INFO] test.conftest:386 Using temporary PUDL_OUTPUT: /path/to/your/temp/dir

  • Any Branch builds outputs: if you have access to the internal build bucket, builds.catalyst.coop, you can also use the Parquet files you find there.

Updating table schemas#

dbt stores information about a table’s schema and what tests are defined in a special YAML file that you need to keep up to date.

That file lives in pudl/dbt/models/<data_source>/<table_name>/schema.yml.

When you change a table’s schema in pudl.metadata.resources, you need to make a matching change to the corresponding dbt YAML file.

In simple cases, dbt_helper can automatically update the schema of an existing table with:

dbt_helper update-tables --schema table_to_update

This will work so long as none of the columns being updated have data tests or other manually defined metadata associated with them. If the script finds tests or metadata it will abort, leaving the schema unchanged, and you will have to update the schema manually, by editing the columns list in the appropriate schema.yml file. If you want to destructively replace an existing schema including any manually added tests or metadata you can use --clobber:

dbt_helper update-tables --schema --clobber table_to_replace_entirely

Updating row counts#

Warning

You should rarely if ever need to edit the row-counts file directly. It needs to be kept sorted to minimize diffs in git, and manually calculating and editing row counts is both tedious and error prone.

To create or update the row count expectations for a given table you need to:

  • Make sure a fresh version of the table is available in $PUDL_OUTPUT/parquet. The expectations will be derived from what’s observed in that file.

  • Add check_row_counts_by_partition to the data_tests section of the the table’s schema.yml, if it isn’t there already.

When ready to generate row count expectations, the data_tests for a new table might look like this:

version: 2
sources:
  - name: pudl
    tables:
      - name: new_table_name
        data_tests:
          - check_row_counts_per_partition:
              arguments:
                table_name: new_table_name
                partition_expr: "EXTRACT(YEAR FROM report_date)"

Then you can run:

dbt_helper update-tables --row-counts new_table_name # for one table at a time
dbt_helper update-tables --row-counts new_table1 new_table2 # for multiple tables
dbt_helper update-tables --row-counts all # for all new tables

If this is a brand new table, you should see changes appear in dbt/seeds/etl_full_row_counts.csv. If you’re updating the row counts for a table that already exists, you’ll need to use the --clobber option to make the script overwrite existing row counts:

dbt_helper update-tables --row-counts --clobber new_table_name

Tip

If you want to update a larger subset of tables (e.g.: all EIA 860 tables), you can use this dagster hack to avoid manually copy and pasting tons of table names. (For the time being, you cannot use the same --asset-select feature as the validate command.)

  • Launch dagster and go to http://localhost:3000/assets.

  • Use the search bar to find the group of assets you’d like to update. Use the select all checkbox at the top to grab all assets.

  • Shift+click the “Materialize Selected” button.

  • In the search bar at the top left of the Launchpad, triple click to select all table names.

  • Copy and paste the list of tables into a text editor of your choice and replace remove commas so that table names are separated by a single space (as shown in the codeblock above).

  • Put that list at the end of the dbt_helper update-tables --row-counts command to update the row counts for all desired tables at once.