.. _data_validation: ========================== Data validation quickstart ========================== Setup ----- The ``dbt/`` directory contains the PUDL dbt project which manages our `data tests `__. dbt is part of the PUDL pixi environment. The data validation tests run on the Parquet outputs that are in your ``$PUDL_OUTPUT/parquet/`` directory. It's important that you ensure the outputs you're testing are actually the result of the code on your current branch, otherwise you may be surprised when the data test passes locally but fails in CI or the nightly builds. We have a script, :mod:`pudl.scripts.dbt_helper`, to help with some common workflows. Running the data validation tests --------------------------------- ``dbt_helper validate`` runs the data validation tests. It's a wrapper around the official dbt tool, ``dbt build`` - see :ref:`dbt_build`. ``dbt_helper validate`` provides rich output when a test fails, and allows us to use the `Dagster asset selection syntax `__. Example usage: .. code-block:: bash # for *all* assets dbt_helper validate # for just a single asset dbt_helper validate --asset-select "key:out_eia__yearly_generators" # for this asset as well as all upstream assets dbt_helper validate --asset-select "+key:out_eia__yearly_generators" # same as above, but skip row counts dbt_helper validate --asset-select "+key:out_eia__yearly_generators" --exclude "*check_row_counts*" See ``dbt_helper validate --help`` for usage details. .. tip:: You may want to run the validation tests against multiple sets of Parquet files. To do this: 1. Download the Parquet files to ``/parquet/``. 2. Set the ``PUDL_OUTPUT`` environment variable to ````. (*note* use an absolute path!) 3. Run any of the ``dbt_helper`` commands you need. Some examples of useful Parquet outputs and where to find them: * `the most recent nightly builds `__ * the fast ETL outputs from your integration tests: these are in a temporary directory created by ``pytest``. Since these are already on your computer you don't need to download them. The path is printed out at the beginning of the ``pytest`` run and will look like: ``2025-07-25 16:05:49 [ INFO] test.conftest:386 Using temporary PUDL_OUTPUT: /path/to/your/temp/dir`` * Any :ref:`branch_builds` outputs: if you have access to the internal build bucket, `builds.catalyst.coop `__, you can also use the Parquet files you find there. .. _update_dbt_schema: Updating table schemas ---------------------- dbt stores information about a table's schema and what tests are defined in a special YAML file that you need to keep up to date. That file lives in ``pudl/dbt/models///schema.yml``. When you change a table's schema in ``pudl.metadata.resources``, you need to make a matching change to the corresponding dbt YAML file. In simple cases, ``dbt_helper`` can automatically update the schema of an existing table with: .. code-block:: bash dbt_helper update-tables --schema table_to_update This will work so long as none of the columns being updated have data tests or other manually defined metadata associated with them. If the script finds tests or metadata it will abort, leaving the schema unchanged, and you will have to update the schema manually, by editing the ``columns`` list in the appropriate ``schema.yml`` file. If you want to destructively replace an existing schema **including any manually added tests or metadata** you can use ``--clobber``: .. code-block:: bash dbt_helper update-tables --schema --clobber table_to_replace_entirely .. _row_counts: Updating row counts ------------------- .. warning:: You should rarely if ever need to edit the row-counts file directly. It needs to be kept sorted to minimize diffs in git, and manually calculating and editing row counts is both tedious and error prone. To create or update the row count expectations for a given table you need to: * Make sure a fresh version of the table is available in ``$PUDL_OUTPUT/parquet``. The expectations will be derived from what's observed in that file. * Add ``check_row_counts_by_partition`` to the ``data_tests`` section of the the table's ``schema.yml``, if it isn't there already. When ready to generate row count expectations, the ``data_tests`` for a new table might look like this: .. code-block:: yaml version: 2 sources: - name: pudl tables: - name: new_table_name data_tests: - check_row_counts_per_partition: arguments: table_name: new_table_name partition_expr: "EXTRACT(YEAR FROM report_date)" Then you can run: .. code-block:: bash dbt_helper update-tables --row-counts new_table_name # for one table at a time dbt_helper update-tables --row-counts new_table1 new_table2 # for multiple tables dbt_helper update-tables --row-counts all # for all new tables If this is a brand new table, you should see changes appear in ``dbt/seeds/etl_full_row_counts.csv``. If you're updating the row counts for a table that already exists, you'll need to use the ``--clobber`` option to make the script overwrite existing row counts: .. code-block:: bash dbt_helper update-tables --row-counts --clobber new_table_name .. tip:: If you want to update a larger subset of tables (e.g.: all EIA 860 tables), you can use this dagster hack to avoid manually copy and pasting tons of table names. (For the time being, you cannot use the same ``--asset-select`` feature as the ``validate`` command.) - Launch dagster and go to ``http://localhost:3000/assets``. - Use the search bar to find the group of assets you'd like to update. Use the select all checkbox at the top to grab all assets. - Shift+click the "Materialize Selected" button. - In the search bar at the top left of the Launchpad, triple click to select all table names. - Copy and paste the list of tables into a text editor of your choice and replace remove commas so that table names are separated by a single space (as shown in the codeblock above). - Put that list at the end of the ``dbt_helper update-tables --row-counts`` command to update the row counts for all desired tables at once.