Entity Resolution =============================================================================== Overview ------------------------------------------------------------------------------- Many of the datasets PUDL processes report the same information about individual entities in many different places. This is usually done for readability. For example, it's nice to list the names of plants or utilities instead of just their IDs. This makes the original data easier for users to read but it also introduces the potential for internal inconsistencies (`entity `__ or `referential integrity `__ issues) which create problems when you're trying to re-use the data in other applications. PUDL attempts to identify canonical values ("`golden records `__") for entities from the potentially inconsistent original data through an `entity resolution `__ process (often called "entity harvesting" in the PUDL codebase). For example, the same plant or generator may appear in many forms, worksheets, and years. Across those sources, attributes like plant name, associated balancing authority, geographic coordinates, or operating dates may be: * omitted in some tables, * reported with slightly different spellings or codes, * updated in one place but not another, or * associated with one year of reporting but not another. If PUDL exposed every upstream table independently without reconciling those inconsistencies, many common analyses would require users to manually decide which value to trust for each entity and year. Instead, PUDL builds normalized entity tables that aim to provide a canonical record for each entity that includes the attributes that are expected to be stable over time and where appropriate, a yearly record of attributes that are expected to change over time. The entity resolution process is applied most extensively to the :doc:`EIA-860 ` and :doc:`EIA-923 ` spreadsheet data, but it's a more general issue that comes up throughout the data we process. This page explains the entity resolution process conceptually, why it exists, and what it means when the PUDL entity tables do not exactly mirror an individual raw EIA spreadsheet. What Entity Resolution Produces ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For each entity type, PUDL typically creates two related tables: * A static entity table such as :ref:`core_eia__entity_plants`, with one row per entity and attributes that are expected to be stable over time. * A yearly slowly changing dimension (SCD) table such as :ref:`core_eia860__scd_plants`, with one row per entity per report year and attributes that are expected to vary slightly over time. PUDL currently resolves four kinds of EIA entities, with these associated entity ID columns: * Utilities ``(utility_id_eia)`` * Plants ``(plant_id_eia)`` * Boilers ``(plant_id_eia, boiler_id)`` * Generators ``(plant_id_eia, generator_id)`` Why A Canonical Record Is Necessary ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Users are sometimes surprised to see slight differences between the raw EIA spreadsheets and analogous PUDL data, but in most of the cases we've tracked down over the years, these differences are expected and intentional. For example, an EIA plant may be named slightly differently across the plant table, generator table, ownership table, and emissions-control tables. One table might use an older balancing authority code. Another might omit latitude and longitude. A generator operating date might be reported consistently in most years, but differ in one source. When this happens, PUDL does not treat every reported value as equally authoritative. Instead, it looks across all relevant upstream tables and determines which value was reported most consistently for this entity. This is usually desirable. A normalized analytical database is more useful when it provides stable, reconciled identifiers and attributes rather than reproducing every reporting inconsistency exactly as filed. How Entity Resolution Works in PUDL ------------------------------------------------------------------------------- Where Entity Resolution Happens in the PUDL Data Pipeline ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Entity resolution sits in the middle of the EIA transformation pipeline. First, dataset-specific transforms create a set of unnormalized ``_core`` tables. These tables are already cleaned and typed, but they still reflect the original structure of the source data. Then we scan those tables for entity IDs and their known attributes, to produce normalized entity tables. The simplified flow for plants looks like this: .. mermaid:: flowchart LR raw[Raw EIA-860 and EIA-923 schedules] core1[_core_eia860__plants] core2[_core_eia860__generators] core3[_core_eia860__ownership] core4[_core_eia923__generation] core5[_core_eia923__fuel_receipts_costs] core6[_core_eia860__emissions_control_equipment] harvest[harvested_plants_eia] entity[core_eia__entity_plants] scd[core_eia860__scd_plants] outputs[out_eia__yearly_plants
and other downstream outputs] raw --> core1 raw --> core2 raw --> core3 raw --> core4 raw --> core5 raw --> core6 core1 --> harvest core2 --> harvest core3 --> harvest core4 --> harvest core5 --> harvest core6 --> harvest harvest --> entity harvest --> scd entity --> outputs scd --> outputs Conceptually, the same pattern is used for all four entity types. The full list of input tables for all EIA entities can be found in :py:const:`pudl.transform.eia.HARVESTABLE_ASSETS` How PUDL Identifies Existing Entities ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The first step is to determine which entities and entity-years exist anywhere in the upstream transformed EIA data. This is done in :func:`pudl.transform.eia._compile_all_entity_records`. To produce the annually varying SCD tables ``report_date`` is normalized to January 1 of the report year so that all annual records align even if an upstream table has a different temporal resolution. This results in an interim dataframe containing every observed instance of the entity across all the upstream tables. The next step is to derive the ID spaces for the entity and annual SCD tables. * The static entity table gets one row per unique entity ID. * The yearly SCD table gets one row per unique combination of ``entity_id`` and ``report_date`` (where ``entity_id`` is sometimes actually composed of multiple columns). This means an entity-year can enter the yearly resolved table from *any* upstream asset that reports the entity ID column and ``report_date``. How PUDL Chooses A Canonical Value ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When there are multiple values reported for the same entity, in most cases, PUDL chooses the most consistent value reported so long as it is found in at least 70% of available entries. If no observed value occurs more than 70% of the time, PUDL fills in a null value. The selection logic looks like this conceptually: .. mermaid:: flowchart TD a[All reported values for one attribute] b{Static or annual?} c[Group by entity ID] d[Group by entity ID and report_date] e[Count entity occurrences] f[Count occurrences of each candidate value] g{Any value exceeds strictness threshold?} h[Keep the most consistent value] i[Leave harvested value null] j[Merge into core_eia__entity_* or core_eia860__scd_*] a --> b b -->|Static| c b -->|Annual| d c --> e d --> e e --> f f --> g g -->|Yes| h g -->|No| i h --> j i --> j Special Cases ~~~~~~~~~~~~~ Some attributes are too messy to reconcile with the default consistency logic alone. We use different rules for columns with additional requirements: * Latitude and longitude are floating point values and are frequently not exactly equal, so 70% consistency is not attainable very often. In cases without a consistent value we round latitude and longitude to the nearest tenth of a degree and repeat the process. * Generator operating date has an unusual pattern of missingness that permits the most recently reported operating date to be reliable when 70% consistency cannot otherwise be reached. * We set the consistency threshold to 0% for a few columns so that we always get a value. These include plant name, utility name, and prime mover code. .. caution:: Because we always choose the most consistent prime mover code, even when less than 50% of occurrences have that value, we occasionally see a tie between two equally consistent values. In these cases, the resolved value may not be deterministic. Because prime mover code is part of the primary key in some tables it is possible for some records to appear or disappear based on which value was chosen. However, this scenario is extremely rare (on the order of 1 in 100,000 records). Interpreting Discrepancies ------------------------------------------------------------------------------- If you compare a PUDL entity table, SCD table, or any table downstream of them to a raw EIA spreadsheet, you should expect to see several kinds of discrepancies: * **A name or code may differ** because PUDL selected the value that was most consistently reported across many tables and years. * **A value may be null in PUDL** because upstream reporting was too inconsistent to unambiguously choose a canonical value. * **An entity-year may exist in PUDL but not in the spreadsheet you are viewing** because the plant-year was reported in a different upstream table. This is especially important when interpreting association tables. A reported relationship seen in one raw data source may be incomplete, outdated, or inconsistent with other sources. PUDL's goal is not to preserve every raw inconsistency in the entity tables, but to produce a coherent cross-table representation that works well for analysis. Improving PUDL Entity Resolution ------------------------------------------------------------------------------- The entity resolution process is heuristic and can definitely be improved. * If there are particular columns that you think would benefit from a domain specific consistency metric, please `open a data issue on GitHub `__ or email us at `hello@catalyst.coop `__! * If you see any static entity attributes that should actually be allowed to vary from year to year, it's easy for us to move them from the entity table to the SCD table. .. caution:: We have **NOT** yet applied this process to the :doc:`EIA-861 ` tables, so they still reflect the original, internally inconsistent reporting, and any utilities which only appear in the EIA-861 data do not yet show up in the utility entity tables. Related Tables And Source Documentation ------------------------------------------------------------------------------- To understand the source data that feed entity resolution, see: * :doc:`/data_sources/eia860` * :doc:`/data_sources/eia923` For the metadata and code that implement this logic, see: * :py:const:`pudl.metadata.resources.ENTITIES` * :mod:`pudl.transform.eia` * :func:`pudl.transform.eia.harvested_entity_asset_factory` * :func:`pudl.transform.eia.harvest_entity_tables` * :func:`pudl.transform.eia._compile_all_entity_records` * :func:`pudl.transform.eia.occurrence_consistency`