pudl.scripts.dbt_helper

A basic CLI to autogenerate dbt data test configurations.

Attributes

Classes

DbtColumn

Define yaml structure of a dbt column.

DbtTable

Define yaml structure of a dbt table.

DbtSource

Define basic dbt yml structure to add a pudl table as a dbt source.

DbtSchema

Define basic structure of a dbt models yaml file.

UpdateResult

TableUpdateArgs

Define a single class to collect the args for all table update commands.

Functions

_prettier_yaml_dumps(→ str)

Dump YAML to string that Prettier likes.

schema_has_removals_or_modifications(→ bool)

Check if schema changes include removals or modifications.

_log_schema_diff(old_schema, new_schema)

Print colored summary of schema changes.

_schema_diff_summary(old_schema, new_schema)

Return a summary of schema changes based on YAML output.

get_data_source(→ str)

Return the data source element of the table's name.

_get_local_table_path(table_name)

_get_model_path(→ pathlib.Path)

_get_row_count_csv_path(→ pathlib.Path)

_get_existing_row_counts(→ pandas.DataFrame)

_calculate_row_counts(→ pandas.DataFrame)

_combine_row_counts(→ pandas.DataFrame)

_write_row_counts(row_counts)

update_row_counts(→ UpdateResult)

Generate updated row counts per partition and write to csv file within dbt project.

update_table_schema(→ UpdateResult)

Generate and write out a schema.yaml file defining a new or updated table.

_log_update_result(result)

_extract_row_count_partitions(→ list[str | None])

Extract partition columns from check_row_counts_per_partition tests in a DbtTable.

update_tables(tables, clobber, schema, row_counts)

Add or update dbt schema configs and row count expectations for PUDL tables.

validate(→ None)

Validate a selection of dbt nodes.

dbt_helper()

Script for auto-generating dbt configuration and migrating existing tests.

Module Contents

pudl.scripts.dbt_helper.logger[source]
pudl.scripts.dbt_helper.ALL_TABLES[source]
pudl.scripts.dbt_helper._prettier_yaml_dumps(yaml_contents: dict[str, Any]) str[source]

Dump YAML to string that Prettier likes.

class pudl.scripts.dbt_helper.DbtColumn(/, **data: Any)[source]

Bases: pydantic.BaseModel

Define yaml structure of a dbt column.

name: str[source]
description: str | None = None[source]
data_tests: list | None = None[source]
meta: dict | None = None[source]
tags: list[str] | None = None[source]
add_column_tests(column_tests: list) DbtColumn[source]

Add data tests to columns in dbt config.

class pudl.scripts.dbt_helper.DbtTable(/, **data: Any)[source]

Bases: pydantic.BaseModel

Define yaml structure of a dbt table.

name: str[source]
description: str | None = None[source]
data_tests: list | None = None[source]
columns: list[DbtColumn][source]
meta: dict | None = None[source]
tags: list[str] | None = None[source]
config: dict | None = None[source]
add_source_tests(source_tests: list) DbtTable[source]

Add data tests to source in dbt config.

add_column_tests(column_tests: dict[str, list]) DbtTable[source]

Add data tests to columns in dbt config.

classmethod from_table_name(table_name: str) DbtTable[source]

Construct configuration defining table from PUDL metadata.

class pudl.scripts.dbt_helper.DbtSource(/, **data: Any)[source]

Bases: pydantic.BaseModel

Define basic dbt yml structure to add a pudl table as a dbt source.

name: str = 'pudl'[source]
tables: list[DbtTable][source]
description: str | None = None[source]
meta: dict | None = None[source]
add_source_tests(source_tests: list) DbtSource[source]

Add data tests to source in dbt config.

add_column_tests(column_tests: dict[str, list]) DbtSource[source]

Add data tests to columns in dbt config.

class pudl.scripts.dbt_helper.DbtSchema(/, **data: Any)[source]

Bases: pydantic.BaseModel

Define basic structure of a dbt models yaml file.

version: int = 2[source]
sources: list[DbtSource][source]
models: list[DbtTable] | None = None[source]
add_source_tests(source_tests: list, model_name: str | None = None) DbtSchema[source]

Add data tests to source in dbt config.

add_column_tests(column_tests: dict[str, list], model_name: str | None = None) DbtSchema[source]

Add data tests to columns in dbt config.

classmethod from_table_name(table_name: str) DbtSchema[source]

Construct configuration defining table from PUDL metadata.

classmethod from_yaml(schema_path: pathlib.Path) DbtSchema[source]

Load a DbtSchema object from a YAML file.

to_yaml(schema_path: pathlib.Path)[source]

Write DbtSchema object to YAML file.

merge_metadata_from(old_schema: DbtSchema) DbtSchema[source]

Merge metadata from an old schema into this one, preferring new values where present.

pudl.scripts.dbt_helper.schema_has_removals_or_modifications(old_schema: dict, new_schema: dict) bool[source]

Check if schema changes include removals or modifications.

Ignores: * Column removals with no metadata (only {“name”} or empty dict) * Column renames (values_changed on [‘name’])

pudl.scripts.dbt_helper._log_schema_diff(old_schema: DbtSchema, new_schema: DbtSchema)[source]

Print colored summary of schema changes.

pudl.scripts.dbt_helper._schema_diff_summary(old_schema: DbtSchema, new_schema: DbtSchema)[source]

Return a summary of schema changes based on YAML output.

pudl.scripts.dbt_helper.get_data_source(table_name: str) str[source]

Return the data source element of the table’s name.

class pudl.scripts.dbt_helper.UpdateResult[source]

Bases: tuple

success[source]
message[source]
pudl.scripts.dbt_helper._get_local_table_path(table_name)[source]
pudl.scripts.dbt_helper._get_model_path(table_name: str, data_source: str) pathlib.Path[source]
pudl.scripts.dbt_helper._get_row_count_csv_path() pathlib.Path[source]
pudl.scripts.dbt_helper._get_existing_row_counts() pandas.DataFrame[source]
pudl.scripts.dbt_helper._calculate_row_counts(table_name: str, partition_expr: str | None = None) pandas.DataFrame[source]
pudl.scripts.dbt_helper._combine_row_counts(existing: pandas.DataFrame, new: pandas.DataFrame) pandas.DataFrame[source]
pudl.scripts.dbt_helper._write_row_counts(row_counts: pandas.DataFrame)[source]
pudl.scripts.dbt_helper.update_row_counts(table_name: str, data_source: str, clobber: bool = False) UpdateResult[source]

Generate updated row counts per partition and write to csv file within dbt project.

pudl.scripts.dbt_helper.update_table_schema(table_name: str, data_source: str, clobber: bool = False) UpdateResult[source]

Generate and write out a schema.yaml file defining a new or updated table.

pudl.scripts.dbt_helper._log_update_result(result: UpdateResult)[source]
pudl.scripts.dbt_helper._extract_row_count_partitions(table: DbtTable) list[str | None][source]

Extract partition columns from check_row_counts_per_partition tests in a DbtTable.

class pudl.scripts.dbt_helper.TableUpdateArgs[source]

Define a single class to collect the args for all table update commands.

tables: list[str][source]
schema: bool = False[source]
row_counts: bool = False[source]
clobber: bool = False[source]
pudl.scripts.dbt_helper.update_tables(tables: list[str], clobber: bool, schema: bool, row_counts: bool)[source]

Add or update dbt schema configs and row count expectations for PUDL tables.

The tables argument can be a single table name, a list of table names, or ‘all’. If ‘all’ the script will update configurations for for all PUDL tables.

If --clobber is set, existing configurations for tables will be overwritten. if this does not result in deletions.

pudl.scripts.dbt_helper.validate(select: str | None = None, asset_select: str | None = None, exclude: str | None = None, dry_run: bool = False) None[source]

Validate a selection of dbt nodes.

Wraps the dbt build command line so we can annotate the result with the actual data that was returned from the test query.

Understands how to translate Dagster asset selection syntax into dbt node selections via the –asset-select flag.

Default behavior if you do not pass –asset-select or –select is to validate everything.

Usage examples:

Run all the checks for one asset:

$ dbt_helper validate –asset-select “key:out_eia__yearly_generators”

Run the checks for one specific dbt node:

$ dbt_helper validate –select “source:pudl_dbt.pudl.out_eia__yearly_generators”

Run checks for an asset and all its upstream dependencies:

$ dbt_helper validate –asset-select “+key:out_eia__yearly_generators”

Exclude the row count tests:

$ dbt_helper validate –asset-select “+key:out_eia__yearly_generators” –exclude “check_row_counts

pudl.scripts.dbt_helper.dbt_helper()[source]

Script for auto-generating dbt configuration and migrating existing tests.

This CLI currently provides the following sub-commands:

update-tables: which can update or create a dbt table (model) schema.yml file under the dbt/models repo. These configuration files tell dbt about the structure of the table and what data tests are specified for it. The script can also generate or update the expected row counts for existing tables, assuming they have been materialized to parquet files and are sitting in your $PUDL_OUTPUT directory.

validate: run validation tests for a selection of dbt nodes.

Run dbt_helper {command} --help for detailed usage on each command.