pudl.io_managers¶
Dagster IO Managers.
Attributes¶
Classes¶
Format switching IOManager that supports sqlite and parquet. |
|
IO Manager that writes and retrieves dataframes from a SQLite database. |
|
IOManager that writes pudl tables to pyarrow parquet files. |
|
Do some extra work to output valid GeoParquet files when appropriate. |
|
IO Manager that writes and retrieves dataframes from a SQLite database. |
|
IO Manager for reading tables from FERC databases. |
|
IO Manager for only reading tables from the FERC 1 database. |
|
IO Manager for only reading tables from the XBRL database. |
Functions¶
|
Retrieves the table name from the context object. |
|
Create a SQLiteManager dagster resource for the pudl database. |
|
Create a Parquet only IO manager. |
|
Create a GeoParquet only IO manager. |
|
Create a SQLiteManager dagster resource for the ferc1 dbf database. |
|
Create a SQLiteManager dagster resource for the ferc1 xbrl database. |
|
Create a SQLiteManager dagster resource for the ferc714 xbrl database. |
Module Contents¶
- pudl.io_managers.get_table_name_from_context(context: dagster.OutputContext) str[source]¶
Retrieves the table name from the context object.
- class pudl.io_managers.PudlMixedFormatIOManager(write_to_parquet: bool = False, read_from_parquet: bool = False)[source]¶
Bases:
dagster.IOManagerFormat switching IOManager that supports sqlite and parquet.
This IOManager provides for the use of parquet files along with the standard SQLite database produced by PUDL.
- handle_output(context: dagster.OutputContext, obj: pandas.DataFrame | str) pandas.DataFrame[source]¶
Passes the output to the appropriate IO manager instance.
- load_input(context: dagster.InputContext) pandas.DataFrame[source]¶
Reads input from the appropriate IO manager instance.
- class pudl.io_managers.SQLiteIOManager(base_dir: str, db_name: str, md: sqlalchemy.MetaData | None = None, timeout: float = 1000.0)[source]¶
Bases:
dagster.IOManagerIO Manager that writes and retrieves dataframes from a SQLite database.
- _setup_database(timeout: float = 1000.0) sqlalchemy.Engine[source]¶
Create database and metadata if they don’t exist.
- Parameters:
timeout – How many seconds the connection should wait before raising an exception, if the database is locked by another connection. If another connection opens a transaction to modify the database, it will be locked until that transaction is committed.
- Returns:
SQL Alchemy engine that connects to a database in the base_dir.
- Return type:
engine
- _get_sqlalchemy_table(table_name: str) sqlalchemy.Table[source]¶
Get SQL Alchemy Table object from metadata given a table_name.
- Parameters:
table_name – The name of the table to look up.
- Returns:
Corresponding SQL Alchemy Table in SQLiteIOManager metadata.
- Return type:
table
- Raises:
ValueError – if table_name does not exist in the SQLiteIOManager metadata.
- _handle_pandas_output(context: dagster.OutputContext, df: pandas.DataFrame)[source]¶
Write dataframe to the database.
SQLite does not support concurrent writes to the database. Instead, SQLite queues write transactions and executes them one at a time. This allows the assets to be processed in parallel. See the SQLAlchemy docs to learn more about SQLite concurrency.
- Parameters:
context – dagster keyword that provides access to output information like asset name.
df – dataframe to write to the database.
- _handle_str_output(context: dagster.OutputContext, query: str)[source]¶
Execute a sql query on the database.
This is used for creating output views in the database.
- Parameters:
context – dagster keyword that provides access output information like asset name.
query – sql query to execute in the database.
- handle_output(context: dagster.OutputContext, obj: pandas.DataFrame | str)[source]¶
Handle an op or asset output.
If the output is a dataframe, write it to the database. If it is a string execute it as a SQL query.
- Parameters:
context – dagster keyword that provides access output information like asset name.
obj – a sql query or dataframe to add to the database.
- Raises:
Exception – if an asset or op returns an unsupported datatype.
- load_input(context: dagster.InputContext) pandas.DataFrame[source]¶
Load a dataframe from a sqlite database.
- Parameters:
context – dagster keyword that provides access output information like asset name.
- class pudl.io_managers.PudlParquetIOManager[source]¶
Bases:
dagster.IOManagerIOManager that writes pudl tables to pyarrow parquet files.
- handle_output(context: dagster.OutputContext, obj: pandas.DataFrame | polars.LazyFrame) None[source]¶
Writes pudl dataframe to parquet file.
- load_input(context: dagster.InputContext) pandas.DataFrame | geopandas.GeoDataFrame | polars.LazyFrame[source]¶
Loads pudl table from parquet file.
- class pudl.io_managers.PudlGeoParquetIOManager[source]¶
Bases:
PudlParquetIOManagerDo some extra work to output valid GeoParquet files when appropriate.
- _create_geoparquet_metadata(gdf: geopandas.GeoDataFrame, res: pudl.metadata.classes.Resource) str[source]¶
Create GeoParquet metadata JSON string.
- handle_output(context: dagster.OutputContext, obj: geopandas.GeoDataFrame) None[source]¶
Write a PUDL dataframe to GeoParquet.
- class pudl.io_managers.PudlSQLiteIOManager(base_dir: str, db_name: str, package: pudl.metadata.classes.Package | None = None, timeout: float = 1000.0)[source]¶
Bases:
SQLiteIOManagerIO Manager that writes and retrieves dataframes from a SQLite database.
This class extends the SQLiteIOManager class to manage database metadata and dtypes using the
pudl.metadata.classes.Packageclass.- _handle_str_output(context: dagster.OutputContext, query: str)[source]¶
Execute a sql query on the database.
This is used for creating output views in the database.
- Parameters:
context – dagster keyword that provides access output information like asset name.
query – sql query to execute in the database.
- _handle_pandas_output(context: dagster.OutputContext, df: pandas.DataFrame)[source]¶
Enforce PUDL DB schema and write dataframe to SQLite.
- load_input(context: dagster.InputContext) pandas.DataFrame[source]¶
Load a dataframe from a sqlite database.
- Parameters:
context – dagster keyword that provides access output information like asset name.
- pudl.io_managers.pudl_mixed_format_io_manager(init_context: dagster.InitResourceContext) dagster.IOManager[source]¶
Create a SQLiteManager dagster resource for the pudl database.
- pudl.io_managers.parquet_io_manager(init_context: dagster.InitResourceContext) dagster.IOManager[source]¶
Create a Parquet only IO manager.
- pudl.io_managers.geoparquet_io_manager(init_context: dagster.InitResourceContext) dagster.IOManager[source]¶
Create a GeoParquet only IO manager.
- class pudl.io_managers.FercSQLiteIOManager(base_dir: str = None, db_name: str = None, md: sqlalchemy.MetaData = None, timeout: float = 1000.0)[source]¶
Bases:
SQLiteIOManagerIO Manager for reading tables from FERC databases.
This class should be subclassed and the load_input and handle_output methods should be implemented.
This IOManager expects the database to already exist.
- _setup_database(timeout: float = 1000.0) sqlalchemy.Engine[source]¶
Create database engine and read the metadata.
- Parameters:
timeout – How many seconds the connection should wait before raising an exception, if the database is locked by another connection. If another connection opens a transaction to modify the database, it will be locked until that transaction is committed.
- Returns:
SQL Alchemy engine that connects to a database in the base_dir.
- Return type:
engine
- abstractmethod handle_output(context: dagster.OutputContext, obj)[source]¶
Handle an op or asset output.
- abstractmethod load_input(context: dagster.InputContext) pandas.DataFrame[source]¶
Load a dataframe from a sqlite database.
- Parameters:
context – dagster keyword that provides access output information like asset name.
- class pudl.io_managers.FercDBFSQLiteIOManager(base_dir: str = None, db_name: str = None, md: sqlalchemy.MetaData = None, timeout: float = 1000.0)[source]¶
Bases:
FercSQLiteIOManagerIO Manager for only reading tables from the FERC 1 database.
This IO Manager is for reading data only. It does not handle outputs because the raw FERC tables are not known prior to running the ETL and are not recorded in our metadata.
- abstractmethod handle_output(context: dagster.OutputContext, obj: pandas.DataFrame | str)[source]¶
Handle an op or asset output.
- load_input(context: dagster.InputContext) pandas.DataFrame[source]¶
Load a dataframe from a sqlite database.
- Parameters:
context – dagster keyword that provides access output information like asset name.
- pudl.io_managers.ferc1_dbf_sqlite_io_manager(init_context) FercDBFSQLiteIOManager[source]¶
Create a SQLiteManager dagster resource for the ferc1 dbf database.
- class pudl.io_managers.FercXBRLSQLiteIOManager(base_dir: str = None, db_name: str = None, md: sqlalchemy.MetaData = None, timeout: float = 1000.0)[source]¶
Bases:
FercSQLiteIOManagerIO Manager for only reading tables from the XBRL database.
This IO Manager is for reading data only. It does not handle outputs because the raw FERC tables are not known prior to running the ETL and are not recorded in our metadata.
- static refine_report_year(df: pandas.DataFrame, xbrl_years: list[int]) pandas.DataFrame[source]¶
Set a fact’s report year by its actual dates.
Sometimes a fact belongs to a context which has no ReportYear associated with it; other times there are multiple ReportYears associated with a single filing. In these cases the report year of a specific fact may be associated with the other years in the filing.
In many cases we can infer the actual report year from the fact’s associated time period - either duration or instant.
- abstractmethod handle_output(context: dagster.OutputContext, obj: pandas.DataFrame | str)[source]¶
Handle an op or asset output.
- load_input(context: dagster.InputContext) pandas.DataFrame[source]¶
Load a dataframe from a sqlite database.
- Parameters:
context – dagster keyword that provides access output information like asset name.
- pudl.io_managers.ferc1_xbrl_sqlite_io_manager(init_context) FercXBRLSQLiteIOManager[source]¶
Create a SQLiteManager dagster resource for the ferc1 xbrl database.
- pudl.io_managers.ferc714_xbrl_sqlite_io_manager(init_context) FercXBRLSQLiteIOManager[source]¶
Create a SQLiteManager dagster resource for the ferc714 xbrl database.