======================================================================================= Data Access ======================================================================================= We publish the PUDL pipeline outputs in several ways to serve different users and use cases. We're always trying to increase the accessibility of the PUDL data, so if you have a suggestion, please `open a GitHub issue `__. If you have a question, you can `create a GitHub discussion `__. We recommend working with tables with the ``out_`` prefix, as these tables contain the most complete and easiest to work with data. For more information about the different types of tables, read through :ref:`PUDL's naming conventions `. --------------------------------------------------------------------------------------- Quick Reference --------------------------------------------------------------------------------------- .. list-table:: :widths: auto :header-rows: 1 * - :ref:`Platform ` - :ref:`Format ` - :ref:`Version ` - User Types - Use Cases * - :ref:`access-viewer` - Parquet, CSV - ``nightly`` - Data Explorer, Spreadsheet Analyst, Jupyter Notebook User - Explore PUDL data interactively in a web browser, including hourly timeseries data. Select data to download as CSVs for local analysis in spreadsheets. Download full tables as Parquet files to play with programmatically. * - :ref:`access-kaggle` - SQLite, Parquet - ``nightly`` - Data Scientist, Data Analyst, Jupyter Notebook User - Work with PUDL data products in Jupyter Notebooks via the web with minimal setup. Explore curated and contributed analyses and visualizations using PUDL data. notebooks. Create and share your own interactive notebooks using PUDL data. * - :ref:`access-cloud` - SQLite, Parquet - ``nightly``, ``stable`` - Data Scientist, Analytics Engineer, Data Engineer, Cloud Developer - Performant remote queries of clearly versioned PUDL Parquet outputs from cloud computing platforms or GitHub Actions. Fast bulk download of SQLite or Parquet outputs for local use. Parquet based data warehouse for large-scale data analysis in the cloud. Integrates well with Pandas, DuckDB, and other dataframe libraries. * - :ref:`access-zenodo` - SQLite, Parquet - ``stable`` - Researcher, Publisher, Archivist - Access a specific, immutable version of the PUDL data by DOI for citation in academic publications or other applications where long-term reproducibility is needed. Web-based bulk download of data for local analysis. .. _access-modes: --------------------------------------------------------------------------------------- How Should You Access PUDL Data? --------------------------------------------------------------------------------------- In order to serve a wider variety of users, we provide several ways to access PUDL data. When choosing an access method you'll want to consider: - What tool or platform do you want to use to access the data? - What data format are you most comfortable with? - Which historical version of the data do you want? .. _access-platform: Data Platform ^^^^^^^^^^^^^ PUDL data is distributed on a number of different platforms to accommodate a variety of different use cases. These include :ref:`access-viewer`, :ref:`access-kaggle`, :ref:`access-cloud`, and :ref:`access-zenodo`. .. _access-format: Data Format ^^^^^^^^^^^ PUDL data is distributed in two main file formats - `SQLite `__: a self-contained relational database that holds many tables in a single file, supported by many programming languages and tools. - `Apache Parquet `__: a compressed, columnar storage format in which each file stores a single table. Parquet supports rich data types and metadata, and is highly performant. All data is distributed with both formats, except: - **Parquet Only**: The hourly data tables are distributed only as Parquet files. These tables have ``hourly`` in their names. - **SQLite Only**: The :ref:`minimally processed FERC data ` which we have converted from XBRL and DBF into SQLite are only available in SQLite. All Parquet data is available through :ref:`access-viewer` for previewing. It can be downloaded as a CSV through that platform if you need to work with it in spreadsheets. For programmatic use we **strongly recommend** that you access the Parquet files in S3 directly. See :ref:`access-cloud`. All SQLite data can be downloaded from S3 (see :ref:`access-cloud`) or our regular versionsed releases (see :ref:`access-zenodo`). We are `working on integrating all converted FERC databases `__. into :ref:`access-viewer`. .. _access-version: Data Version ^^^^^^^^^^^^ We assign a version number to our quarterly data releases so they can be easily identified. These versions are based on the date of publication. For example, ``v2024.11.0`` would be the first release of the data that happened in November 2024. These are referred to as ``stable`` releases, and are archived for long-term access and citation. We also provide access to a ``nightly`` development build of the data, which is updated most weekday mornings. These builds are useful for beta testing new outputs, but are ephemeral and may not be as well validated as the ``stable`` releases. .. _access-viewer: --------------------------------------------------------------------------------------- PUDL Data Viewer --------------------------------------------------------------------------------------- We recently released the `PUDL Data Viewer `__ in beta. It provides flexible search of table metadata, live data preview with filtering and sorting, and CSV export of up to 5 million rows. Finally, it also has links to the Parquet downloads for each table, which you can view directly with tools like `Tad `__. .. _access-kaggle: --------------------------------------------------------------------------------------- Kaggle --------------------------------------------------------------------------------------- Are you comfortable with Jupyter Notebooks? Want to explore a fresh version of all available PUDL data without needing to do any environment setup? We provide several `example notebooks on Kaggle `__. (they are also pushed to our `PUDL Examples Repo on GitHub `__). These notebooks pull data directly from the Parquet outputs in S3 (see below) and so can also be used locally if you're familiar with setting up a Python environment and running Jupyter. Our nightly build outputs also automatically update the `PUDL Kaggle dataset `__ once a week. This dataset contains all the PUDL outputs, so it's quite large and can take a few minutes to copy into your Kaggle notebook's private workspace, but once it's copied, access will be fast. .. _access-cloud: --------------------------------------------------------------------------------------- Cloud Storage --------------------------------------------------------------------------------------- All PUDL data products are freely available in the `AWS Open Data Registry `__ including both ``stable`` and ``nightly`` outputs and multiple years of past stable releases. These include data in both SQLite and Parquet formats. The AWS S3 bucket is: .. code-block:: bash s3://pudl.catalyst.coop The same outputs are available in a similarly named "requester pays" Google Cloud Storage bucket. However, you will need to authenticate your GCP account. The GCS bucket is: .. code-block:: bash gs://pudl.catalyst.coop SQLite databases must be downloaded for local use, but Parquet files can be queried remotely using a number of different tools. Some examples below: Pandas ^^^^^^ Using `Pandas read_parquet() `__ .. note:: You will need to install pandas with the `extra cloud dependencies `__. .. code-block:: python import pandas as pd # Outputs from the most recent nightly build: nightly_df = pd.read_parquet("s3://pudl.catalyst.coop/nightly/core_eia__codes_energy_sources.parquet") # Outputs from the most recent stable data release: stable_df = pd.read_parquet("s3://pudl.catalyst.coop/stable/core_eia__codes_energy_sources.parquet") # A specific stable version of the data: versioned_df = pd.read_parquet("s3://pudl.catalyst.coop/v2024.11.0/core_eia__codes_energy_sources.parquet") DuckDB ^^^^^^ Using `DuckDB `__ and the `httpfs extension `__ .. code-block:: sql -- Install the httpfs extension once and it will be available in subsequent sessions INSTALL httpfs; SELECT * FROM read_parquet('s3://pudl.catalyst.coop/nightly/core_eia__codes_energy_sources.parquet'); Other Dataframe Libraries ^^^^^^^^^^^^^^^^^^^^^^^^^ Similar functionality exists for the `dplyr library in R `__, the `polars library in Rust `__, and many other programmatic data analysis tools. The AWS CLI ^^^^^^^^^^^ You can also use `the AWS CLI `__ to see what data is available and download it locally. For example, to list the contents of the AWS S3 bucket to see what historic versions are available: .. code-block:: bash aws s3 ls --no-sign-request s3://pudl.catalyst.coop/ To list the contents of a particular version: .. code-block:: bash aws s3 ls --no-sign-request s3://pudl.catalyst.coop/v2024.8.0/ And then download the full PUDL SQLite database from the nightly build outputs: .. code-block:: bash aws s3 cp --no-sign-request s3://pudl.catalyst.coop/nightly/pudl.sqlite.zip . Direct Links for Bulk Download ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The links below allow bulk download the most recent ``nightly`` builds of all the SQLite databases produced by PUDL, as well as their associated metadata in JSON. Fully Processed SQLite Databases ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ * `Main PUDL Database `__ (~3GB) * `US Census DP1 Database (2010) `__ .. _access-raw-ferc: Raw FERC DBF & XBRL data converted to SQLite ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ * FERC Form 1: * `FERC-1 SQLite derived from DBF (1994-2020) `__ * `FERC-1 SQLite derived from XBRL (2021-2024) `__ * `FERC-1 Datapackage (JSON) describing SQLite derived from XBRL `__ * `FERC-1 XBRL Taxonomy Metadata as JSON (2021-2024) `__ * FERC Form 2: * `FERC-2 SQLite derived from DBF (1996-2020) `__ * `FERC-2 SQLite derived from XBRL (2021-2024) `__ * `FERC-2 Datapackage (JSON) describing SQLite derived from XBRL `__ * `FERC-2 XBRL Taxonomy Metadata as JSON (2021-2024) `__ * FERC Form 6: * `FERC-6 SQLite derived from DBF (2000-2020) `__ * `FERC-6 SQLite derived from XBRL (2021-2024) `__ * `FERC-6 Datapackage (JSON) describing SQLite derived from XBRL `__ * `FERC-6 XBRL Taxonomy Metadata as JSON (2021-2024) `__ * FERC Form 60: * `FERC-60 SQLite derived from DBF (2006-2020) `__ * `FERC-60 SQLite derived from XBRL (2021-2024) `__ * `FERC-60 Datapackage (JSON) describing SQLite derived from XBRL `__ * `FERC-60 XBRL Taxonomy Metadata as JSON (2021-2024) `__ * FERC Form 714: * `FERC-714 SQLite derived from XBRL (2021-2023) `__ * `FERC-714 Datapackage (JSON) describing SQLite derived from XBRL `__ * `FERC-714 XBRL Taxonomy Metadata as JSON (2021-2023) `__ .. _access-zenodo: --------------------------------------------------------------------------------------- Zenodo Archives --------------------------------------------------------------------------------------- If you want a specific, immutable version of our data for any reason, you can find them all `here on Zenodo `__. Zenodo assigns long-lived DOIs to each archive, suitable for citation in academic journals and other publications. The most recent versioned PUDL data release can always be found using this Concept DOI: https://doi.org/10.5281/zenodo.3653158 From Zenodo you can download individual SQLite databases and a zipfile containing all the Parquet files bundled together. The documentation for the latest such stable build is `here `__. You can access the documentation for a specific version by hovering over the version selector at the bottom left of the page. .. _access-raw: --------------------------------------------------------------------------------------- Raw Data --------------------------------------------------------------------------------------- Sometimes you want to see the raw data that is published by the government, but it's hard to find or difficult to download, or you want to see what an older version of the published data looked like prior to being revised or deleted. We use Zenodo to archive and version our raw data inputs. You can find all of our archives in `the Catalyst Cooperative Community `__. These have been minimally processed - in some cases, we've compressed them or grouped them into ZIP archives to fit the Zenodo repository requirements. In all cases we've added some metadata to help identify the resources you're looking for. But, apart from that, these datasets are unmodified. .. _access-development: --------------------------------------------------------------------------------------- Development Environment --------------------------------------------------------------------------------------- If you want to run the PUDL data processing pipeline yourself from scratch, run the software tests, or make changes to the source code, you'll need to set up our development environment. This is a bit involved, so it has its :doc:`own separate documentation `. Most users shouldn't need to do this, and will probably find working with the pre-processed data via one of the other access modes easier. But if you want to :doc:`contribute to the project `, please give it a shot!