Entity Resolution#

Overview#

Many of the datasets PUDL processes report the same information about individual entities in many different places. This is usually done for readability. For example, it’s nice to list the names of plants or utilities instead of just their IDs. This makes the original data easier for users to read but it also introduces the potential for internal inconsistencies (entity or referential integrity issues) which create problems when you’re trying to re-use the data in other applications.

PUDL attempts to identify canonical values (”golden records”) for entities from the potentially inconsistent original data through an entity resolution process (often called “entity harvesting” in the PUDL codebase).

For example, the same plant or generator may appear in many forms, worksheets, and years. Across those sources, attributes like plant name, associated balancing authority, geographic coordinates, or operating dates may be:

omitted in some tables,
reported with slightly different spellings or codes,
updated in one place but not another, or
associated with one year of reporting but not another.

If PUDL exposed every upstream table independently without reconciling those inconsistencies, many common analyses would require users to manually decide which value to trust for each entity and year. Instead, PUDL builds normalized entity tables that aim to provide a canonical record for each entity that includes the attributes that are expected to be stable over time and where appropriate, a yearly record of attributes that are expected to change over time.

The entity resolution process is applied most extensively to the EIA-860 and EIA-923 spreadsheet data, but it’s a more general issue that comes up throughout the data we process. This page explains the entity resolution process conceptually, why it exists, and what it means when the PUDL entity tables do not exactly mirror an individual raw EIA spreadsheet.

What Entity Resolution Produces#

For each entity type, PUDL typically creates two related tables:

A static entity table such as core_eia__entity_plants, with one row per entity and attributes that are expected to be stable over time.
A yearly slowly changing dimension (SCD) table such as core_eia860__scd_plants, with one row per entity per report year and attributes that are expected to vary slightly over time.

PUDL currently resolves four kinds of EIA entities, with these associated entity ID columns:

Utilities (utility_id_eia)
Plants (plant_id_eia)
Boilers (plant_id_eia, boiler_id)
Generators (plant_id_eia, generator_id)

Why A Canonical Record Is Necessary#

Users are sometimes surprised to see slight differences between the raw EIA spreadsheets and analogous PUDL data, but in most of the cases we’ve tracked down over the years, these differences are expected and intentional.

For example, an EIA plant may be named slightly differently across the plant table, generator table, ownership table, and emissions-control tables. One table might use an older balancing authority code. Another might omit latitude and longitude. A generator operating date might be reported consistently in most years, but differ in one source.

When this happens, PUDL does not treat every reported value as equally authoritative. Instead, it looks across all relevant upstream tables and determines which value was reported most consistently for this entity.

This is usually desirable. A normalized analytical database is more useful when it provides stable, reconciled identifiers and attributes rather than reproducing every reporting inconsistency exactly as filed.

How Entity Resolution Works in PUDL#

Where Entity Resolution Happens in the PUDL Data Pipeline#

Entity resolution sits in the middle of the EIA transformation pipeline.

First, dataset-specific transforms create a set of unnormalized _core tables. These tables are already cleaned and typed, but they still reflect the original structure of the source data. Then we scan those tables for entity IDs and their known attributes, to produce normalized entity tables.

The simplified flow for plants looks like this:

        flowchart LR
    raw[Raw EIA-860 and EIA-923 schedules]
    core1[_core_eia860__plants]
    core2[_core_eia860__generators]
    core3[_core_eia860__ownership]
    core4[_core_eia923__generation]
    core5[_core_eia923__fuel_receipts_costs]
    core6[_core_eia860__emissions_control_equipment]
    harvest[harvested_plants_eia]
    entity[core_eia__entity_plants]
    scd[core_eia860__scd_plants]
    outputs[out_eia__yearly_plants<br/>and other downstream outputs]

    raw --> core1
    raw --> core2
    raw --> core3
    raw --> core4
    raw --> core5
    raw --> core6
    core1 --> harvest
    core2 --> harvest
    core3 --> harvest
    core4 --> harvest
    core5 --> harvest
    core6 --> harvest
    harvest --> entity
    harvest --> scd
    entity --> outputs
    scd --> outputs

Conceptually, the same pattern is used for all four entity types. The full list of input tables for all EIA entities can be found in pudl.transform.eia.HARVESTABLE_ASSETS

How PUDL Identifies Existing Entities#

The first step is to determine which entities and entity-years exist anywhere in the upstream transformed EIA data. This is done in pudl.transform.eia._compile_all_entity_records().

To produce the annually varying SCD tables report_date is normalized to January 1 of the report year so that all annual records align even if an upstream table has a different temporal resolution.

This results in an interim dataframe containing every observed instance of the entity across all the upstream tables.

The next step is to derive the ID spaces for the entity and annual SCD tables.

The static entity table gets one row per unique entity ID.
The yearly SCD table gets one row per unique combination of entity_id and report_date (where entity_id is sometimes actually composed of multiple columns).

This means an entity-year can enter the yearly resolved table from any upstream asset that reports the entity ID column and report_date.

How PUDL Chooses A Canonical Value#

When there are multiple values reported for the same entity, in most cases, PUDL chooses the most consistent value reported so long as it is found in at least 70% of available entries. If no observed value occurs more than 70% of the time, PUDL fills in a null value.

The selection logic looks like this conceptually:

        flowchart TD
    a[All reported values for one attribute]
    b{Static or annual?}
    c[Group by entity ID]
    d[Group by entity ID and report_date]
    e[Count entity occurrences]
    f[Count occurrences of each candidate value]
    g{Any value exceeds strictness threshold?}
    h[Keep the most consistent value]
    i[Leave harvested value null]
    j[Merge into core_eia__entity_* or core_eia860__scd_*]

    a --> b
    b -->|Static| c
    b -->|Annual| d
    c --> e
    d --> e
    e --> f
    f --> g
    g -->|Yes| h
    g -->|No| i
    h --> j
    i --> j

Special Cases#

Some attributes are too messy to reconcile with the default consistency logic alone. We use different rules for columns with additional requirements:

Latitude and longitude are floating point values and are frequently not exactly equal, so 70% consistency is not attainable very often. In cases without a consistent value we round latitude and longitude to the nearest tenth of a degree and repeat the process.
Generator operating date has an unusual pattern of missingness that permits the most recently reported operating date to be reliable when 70% consistency cannot otherwise be reached.
We set the consistency threshold to 0% for a few columns so that we always get a value. These include plant name, utility name, and prime mover code.

Caution

Because we always choose the most consistent prime mover code, even when less than 50% of occurrences have that value, we occasionally see a tie between two equally consistent values. In these cases, the resolved value may not be deterministic. Because prime mover code is part of the primary key in some tables it is possible for some records to appear or disappear based on which value was chosen. However, this scenario is extremely rare (on the order of 1 in 100,000 records).

Interpreting Discrepancies#

If you compare a PUDL entity table, SCD table, or any table downstream of them to a raw EIA spreadsheet, you should expect to see several kinds of discrepancies:

A name or code may differ because PUDL selected the value that was most consistently reported across many tables and years.
A value may be null in PUDL because upstream reporting was too inconsistent to unambiguously choose a canonical value.
An entity-year may exist in PUDL but not in the spreadsheet you are viewing because the plant-year was reported in a different upstream table.

This is especially important when interpreting association tables. A reported relationship seen in one raw data source may be incomplete, outdated, or inconsistent with other sources. PUDL’s goal is not to preserve every raw inconsistency in the entity tables, but to produce a coherent cross-table representation that works well for analysis.

Improving PUDL Entity Resolution#

The entity resolution process is heuristic and can definitely be improved.

If there are particular columns that you think would benefit from a domain specific consistency metric, please open a data issue on GitHub or email us at hello@catalyst.coop!
If you see any static entity attributes that should actually be allowed to vary from year to year, it’s easy for us to move them from the entity table to the SCD table.

Caution

We have NOT yet applied this process to the EIA-861 tables, so they still reflect the original, internally inconsistent reporting, and any utilities which only appear in the EIA-861 data do not yet show up in the utility entity tables.