pudl.scripts.zenodo_data_release

Upload a prepared PUDL data release directory to Zenodo.

The PUDL data release process produces a directory of artifacts (zipped Parquet files, SQLite databases, JSON metadata, logs, etc.) that are uploaded to CERN’s Zenodo data repository for long-term archival access. Each new versioned release of PUDL is associated with the same original PUDL concept DOI.

This module provides a CLI that handles the process of uploading a new PUDL data release to Zenodo, given a prepared directory of artifacts typically produced by the PUDL builds.

It uses state objects to ensure that Zenodo API calls happen in a valid order. The files to upload are read using fsspec and remote files are staged locally one at a time so uploads can be retried, but without using excessive local disk space.

Retries are implemented for all upload requests to recover from transient network issues and Zenodo server flakiness. Zero-byte uploads are prevented.

NOTE: PUDL nightly build outputs are NOT suitable for producing a Zenodo data release unless the Parquet outputs are filtered out with an appropriate ignore_regex. Double check what files should actually be distributed before running the script.

Run zenodo_data_release --help for CLI usage instructions.

Attributes

Classes

_LegacyLinks

!!! abstract "Usage Documentation"

_LegacyMetadata

!!! abstract "Usage Documentation"

_LegacyDeposition

!!! abstract "Usage Documentation"

_NewFile

!!! abstract "Usage Documentation"

_NewRecord

!!! abstract "Usage Documentation"

ZenodoClient

Thin wrapper over Zenodo REST API.

State

Parent class for dataset states.

InitialDataset

Represent initial dataset state.

EmptyDraft

We can only sync the directory once we've gotten an empty draft.

ContentComplete

Now that we've uploaded all the data, we need to update metadata.

CompleteDraft

Now that we've uploaded all the data, we can publish.

Functions

pudl_zenodo_data_release(env, source_dir, publish, ignore)

Publish a new PUDL data release to Zenodo.

Module Contents

pudl.scripts.zenodo_data_release.SANDBOX = 'sandbox'[source]
pudl.scripts.zenodo_data_release.PRODUCTION = 'production'[source]
pudl.scripts.zenodo_data_release.RETRYABLE_STATUS_CODES[source]
pudl.scripts.zenodo_data_release.logger[source]

Bases: pydantic.BaseModel

!!! abstract “Usage Documentation”

[Models](../concepts/models.md)

A base class for creating Pydantic models.

__class_vars__[source]

The names of the class variables defined on the model.

__private_attributes__[source]

Metadata about the private attributes of the model.

__signature__[source]

The synthesized __init__ [Signature][inspect.Signature] of the model.

__pydantic_complete__[source]

Whether model building is completed, or if there are still undefined fields.

__pydantic_core_schema__[source]

The core schema of the model.

__pydantic_custom_init__[source]

Whether the model has a custom __init__ function.

__pydantic_decorators__[source]

Metadata containing the decorators defined on the model. This replaces Model.__validators__ and Model.__root_validators__ from Pydantic V1.

__pydantic_generic_metadata__[source]

Metadata for generic models; contains data used for a similar purpose to __args__, __origin__, __parameters__ in typing-module generics. May eventually be replaced by these.

__pydantic_parent_namespace__[source]

Parent namespace of the model, used for automatic rebuilding of models.

__pydantic_post_init__[source]

The name of the post-init method for the model, if defined.

__pydantic_root_model__[source]

Whether the model is a [RootModel][pydantic.root_model.RootModel].

__pydantic_serializer__[source]

The pydantic-core SchemaSerializer used to dump instances of the model.

__pydantic_validator__[source]

The pydantic-core SchemaValidator used to validate instances of the model.

__pydantic_fields__[source]

A dictionary of field names and their corresponding [FieldInfo][pydantic.fields.FieldInfo] objects.

__pydantic_computed_fields__[source]

A dictionary of computed field names and their corresponding [ComputedFieldInfo][pydantic.fields.ComputedFieldInfo] objects.

__pydantic_extra__[source]

A dictionary containing extra values, if [extra][pydantic.config.ConfigDict.extra] is set to ‘allow’.

__pydantic_fields_set__[source]

The names of fields explicitly set during instantiation.

__pydantic_private__[source]

Values of private attributes set on the model instance.

html: pydantic.AnyHttpUrl[source]
bucket: pydantic.AnyHttpUrl[source]
class pudl.scripts.zenodo_data_release._LegacyMetadata(/, **data: Any)[source]

Bases: pydantic.BaseModel

!!! abstract “Usage Documentation”

[Models](../concepts/models.md)

A base class for creating Pydantic models.

__class_vars__[source]

The names of the class variables defined on the model.

__private_attributes__[source]

Metadata about the private attributes of the model.

__signature__[source]

The synthesized __init__ [Signature][inspect.Signature] of the model.

__pydantic_complete__[source]

Whether model building is completed, or if there are still undefined fields.

__pydantic_core_schema__[source]

The core schema of the model.

__pydantic_custom_init__[source]

Whether the model has a custom __init__ function.

__pydantic_decorators__[source]

Metadata containing the decorators defined on the model. This replaces Model.__validators__ and Model.__root_validators__ from Pydantic V1.

__pydantic_generic_metadata__[source]

Metadata for generic models; contains data used for a similar purpose to __args__, __origin__, __parameters__ in typing-module generics. May eventually be replaced by these.

__pydantic_parent_namespace__[source]

Parent namespace of the model, used for automatic rebuilding of models.

__pydantic_post_init__[source]

The name of the post-init method for the model, if defined.

__pydantic_root_model__[source]

Whether the model is a [RootModel][pydantic.root_model.RootModel].

__pydantic_serializer__[source]

The pydantic-core SchemaSerializer used to dump instances of the model.

__pydantic_validator__[source]

The pydantic-core SchemaValidator used to validate instances of the model.

__pydantic_fields__[source]

A dictionary of field names and their corresponding [FieldInfo][pydantic.fields.FieldInfo] objects.

__pydantic_computed_fields__[source]

A dictionary of computed field names and their corresponding [ComputedFieldInfo][pydantic.fields.ComputedFieldInfo] objects.

__pydantic_extra__[source]

A dictionary containing extra values, if [extra][pydantic.config.ConfigDict.extra] is set to ‘allow’.

__pydantic_fields_set__[source]

The names of fields explicitly set during instantiation.

__pydantic_private__[source]

Values of private attributes set on the model instance.

upload_type: str = 'dataset'[source]
title: str[source]
access_right: str[source]
creators: list[dict][source]
license: str = 'cc-by-4.0'[source]
publication_date: str = ''[source]
description: str = ''[source]
class pudl.scripts.zenodo_data_release._LegacyDeposition(/, **data: Any)[source]

Bases: pydantic.BaseModel

!!! abstract “Usage Documentation”

[Models](../concepts/models.md)

A base class for creating Pydantic models.

__class_vars__[source]

The names of the class variables defined on the model.

__private_attributes__[source]

Metadata about the private attributes of the model.

__signature__[source]

The synthesized __init__ [Signature][inspect.Signature] of the model.

__pydantic_complete__[source]

Whether model building is completed, or if there are still undefined fields.

__pydantic_core_schema__[source]

The core schema of the model.

__pydantic_custom_init__[source]

Whether the model has a custom __init__ function.

__pydantic_decorators__[source]

Metadata containing the decorators defined on the model. This replaces Model.__validators__ and Model.__root_validators__ from Pydantic V1.

__pydantic_generic_metadata__[source]

Metadata for generic models; contains data used for a similar purpose to __args__, __origin__, __parameters__ in typing-module generics. May eventually be replaced by these.

__pydantic_parent_namespace__[source]

Parent namespace of the model, used for automatic rebuilding of models.

__pydantic_post_init__[source]

The name of the post-init method for the model, if defined.

__pydantic_root_model__[source]

Whether the model is a [RootModel][pydantic.root_model.RootModel].

__pydantic_serializer__[source]

The pydantic-core SchemaSerializer used to dump instances of the model.

__pydantic_validator__[source]

The pydantic-core SchemaValidator used to validate instances of the model.

__pydantic_fields__[source]

A dictionary of field names and their corresponding [FieldInfo][pydantic.fields.FieldInfo] objects.

__pydantic_computed_fields__[source]

A dictionary of computed field names and their corresponding [ComputedFieldInfo][pydantic.fields.ComputedFieldInfo] objects.

__pydantic_extra__[source]

A dictionary containing extra values, if [extra][pydantic.config.ConfigDict.extra] is set to ‘allow’.

__pydantic_fields_set__[source]

The names of fields explicitly set during instantiation.

__pydantic_private__[source]

Values of private attributes set on the model instance.

id_: int = None[source]
conceptrecid: int[source]
metadata: _LegacyMetadata[source]
class pudl.scripts.zenodo_data_release._NewFile(/, **data: Any)[source]

Bases: pydantic.BaseModel

!!! abstract “Usage Documentation”

[Models](../concepts/models.md)

A base class for creating Pydantic models.

__class_vars__[source]

The names of the class variables defined on the model.

__private_attributes__[source]

Metadata about the private attributes of the model.

__signature__[source]

The synthesized __init__ [Signature][inspect.Signature] of the model.

__pydantic_complete__[source]

Whether model building is completed, or if there are still undefined fields.

__pydantic_core_schema__[source]

The core schema of the model.

__pydantic_custom_init__[source]

Whether the model has a custom __init__ function.

__pydantic_decorators__[source]

Metadata containing the decorators defined on the model. This replaces Model.__validators__ and Model.__root_validators__ from Pydantic V1.

__pydantic_generic_metadata__[source]

Metadata for generic models; contains data used for a similar purpose to __args__, __origin__, __parameters__ in typing-module generics. May eventually be replaced by these.

__pydantic_parent_namespace__[source]

Parent namespace of the model, used for automatic rebuilding of models.

__pydantic_post_init__[source]

The name of the post-init method for the model, if defined.

__pydantic_root_model__[source]

Whether the model is a [RootModel][pydantic.root_model.RootModel].

__pydantic_serializer__[source]

The pydantic-core SchemaSerializer used to dump instances of the model.

__pydantic_validator__[source]

The pydantic-core SchemaValidator used to validate instances of the model.

__pydantic_fields__[source]

A dictionary of field names and their corresponding [FieldInfo][pydantic.fields.FieldInfo] objects.

__pydantic_computed_fields__[source]

A dictionary of computed field names and their corresponding [ComputedFieldInfo][pydantic.fields.ComputedFieldInfo] objects.

__pydantic_extra__[source]

A dictionary containing extra values, if [extra][pydantic.config.ConfigDict.extra] is set to ‘allow’.

__pydantic_fields_set__[source]

The names of fields explicitly set during instantiation.

__pydantic_private__[source]

Values of private attributes set on the model instance.

id_: str = None[source]
class pudl.scripts.zenodo_data_release._NewRecord(/, **data: Any)[source]

Bases: pydantic.BaseModel

!!! abstract “Usage Documentation”

[Models](../concepts/models.md)

A base class for creating Pydantic models.

__class_vars__[source]

The names of the class variables defined on the model.

__private_attributes__[source]

Metadata about the private attributes of the model.

__signature__[source]

The synthesized __init__ [Signature][inspect.Signature] of the model.

__pydantic_complete__[source]

Whether model building is completed, or if there are still undefined fields.

__pydantic_core_schema__[source]

The core schema of the model.

__pydantic_custom_init__[source]

Whether the model has a custom __init__ function.

__pydantic_decorators__[source]

Metadata containing the decorators defined on the model. This replaces Model.__validators__ and Model.__root_validators__ from Pydantic V1.

__pydantic_generic_metadata__[source]

Metadata for generic models; contains data used for a similar purpose to __args__, __origin__, __parameters__ in typing-module generics. May eventually be replaced by these.

__pydantic_parent_namespace__[source]

Parent namespace of the model, used for automatic rebuilding of models.

__pydantic_post_init__[source]

The name of the post-init method for the model, if defined.

__pydantic_root_model__[source]

Whether the model is a [RootModel][pydantic.root_model.RootModel].

__pydantic_serializer__[source]

The pydantic-core SchemaSerializer used to dump instances of the model.

__pydantic_validator__[source]

The pydantic-core SchemaValidator used to validate instances of the model.

__pydantic_fields__[source]

A dictionary of field names and their corresponding [FieldInfo][pydantic.fields.FieldInfo] objects.

__pydantic_computed_fields__[source]

A dictionary of computed field names and their corresponding [ComputedFieldInfo][pydantic.fields.ComputedFieldInfo] objects.

__pydantic_extra__[source]

A dictionary containing extra values, if [extra][pydantic.config.ConfigDict.extra] is set to ‘allow’.

__pydantic_fields_set__[source]

The names of fields explicitly set during instantiation.

__pydantic_private__[source]

Values of private attributes set on the model instance.

id_: int = None[source]
files: list[_NewFile][source]
class pudl.scripts.zenodo_data_release.ZenodoClient(env: str)[source]

Thin wrapper over Zenodo REST API.

Mostly legacy calls (https://developers.zenodo.org/) (archive: https://web.archive.org/web/20231212025359/https://developers.zenodo.org/) but due to inconsistent behavior of legacy API on sandbox environment, we need some of the unreleased new API endpoints too: https://inveniordm.docs.cern.ch/reference/rest_api_drafts_records/

auth_headers[source]
retry_request(*, method, url, max_tries: int = 6, request_timeout: float | None = None, data_factory: collections.abc.Callable[[], IO[bytes]] | None = None, **kwargs) requests.Response[source]

Retry calls to requests.request with exponential backoff.

Parameters:
  • method – HTTP method to use for the request (e.g. GET).

  • url – Fully-qualified URL to which the request is sent.

  • max_tries – Maximum number of attempts before surfacing an error.

  • request_timeout – Optional per-request timeout in seconds. When None the timeout grows exponentially (2**attempt).

  • data_factory – Optional callable that yields a fresh binary stream for each attempt. Useful for uploads that require reopening a file-like object.

  • **kwargs – Additional keyword arguments passed through directly to requests.request.

Returns:

The requests.Response produced by the successful attempt.

Raises:
  • requests.RequestException – If all attempts fail with a requests error.

  • OSError – If reading from disk fails when preparing a payload.

  • RuntimeError – If no response object is produced (should be rare).

get_deposition(deposition_id: int) _LegacyDeposition[source]

LEGACY API: Get JSON describing a deposition.

Depositions can be published or unpublished.

get_record(record_id: int) _NewRecord[source]

NEW API: Get JSON describing a record.

All records are published records.

new_record_version(record_id: int) _NewRecord[source]

NEW API: get or create the draft associated with a record ID.

Finds the latest record in the concept that record_id points to, and makes a new version unless one exists already.

update_deposition_metadata(deposition_id: int, metadata: _LegacyMetadata) _LegacyDeposition[source]

LEGACY API: Update deposition metadata.

Replaces the existing metadata completely - so make sure to pass in complete metadata. You cannot update metadata fields one at a time.

delete_deposition_file(deposition_id: int, file_id) requests.Response[source]

LEGACY API: Delete file from deposition.

Note: file_id is not always the file name.

create_bucket_file(bucket_url: pydantic.AnyHttpUrl, file_path: pathlib.Path, max_tries: int = 6) requests.Response[source]

LEGACY API: Upload a file to a deposition’s file bucket.

We prefer this API this over the /deposit/depositions/{id}/files endpoint because it allows for files >100MB.

Parameters:
  • bucket_url – Upload destination returned by Zenodo for the draft.

  • file_path – Local path to the artifact being uploaded.

  • max_tries – Maximum number of upload attempts before failing.

Returns:

The requests.Response from the successful upload attempt.

Raises:
  • ValueError – If file_path is empty.

  • requests.RequestException – If all upload attempts fail.

publish_deposition(deposition_id: int) _LegacyDeposition[source]

LEGACY API: publish deposition.

class pudl.scripts.zenodo_data_release.State[source]

Parent class for dataset states.

Provides an abstraction layer that hides Zenodo’s data model from the caller.

Subclasses + their limited method definitions provide a way to avoid calling the operations in the wrong order.

record_id: int[source]
zenodo_client: ZenodoClient[source]
class pudl.scripts.zenodo_data_release.InitialDataset[source]

Bases: State

Represent initial dataset state.

At this point, we don’t know if there is an existing draft or not - the only thing we can do is try to get a fresh draft.

get_empty_draft() EmptyDraft[source]

Get an empty draft for this dataset.

Use new API to get any draft, then use legacy API to delete any files in the draft.

class pudl.scripts.zenodo_data_release.EmptyDraft[source]

Bases: State

We can only sync the directory once we’ve gotten an empty draft.

static _sync_local_path(openable_file: fsspec.core.OpenFile, staging_dir: pathlib.Path) pathlib.Path[source]

Ensure the given fsspec file exists on the local filesystem.

When openable_file already resides on the local filesystem we avoid copying and return its existing path. Remote files are downloaded into staging_dir (a shared temporary directory) so the rest of the upload pipeline can treat every artifact as a simple Path without caring where it came from.

Parameters:
  • openable_filefsspec handle pointing to the source artifact.

  • staging_dir – Directory used to cache remote files locally.

Returns:

A Path pointing to a readable local copy of openable_file.

sync_directory(source_dir: str, ignore: tuple[str]) ContentComplete[source]

Upload every file in source_dir to the draft bucket.

The method enumerates files (not subdirectories) via fsspec so the source can live on local disk, GCS, S3, etc. Remote objects are first staged into a temporary directory to ensure uploads always come from local Path objects that can be rewound for retries. Regex patterns provided via ignore are applied to the full path of each candidate file, allowing us to drop logs, intermediate data, or other nightlies-only artifacts before hitting Zenodo.

Parameters:
  • source_dir – Directory (local or remote) whose contents will be sent to Zenodo.

  • ignore – Tuple of regex patterns; any path matching one is skipped.

Returns:

A ContentComplete state ready for metadata updates.

class pudl.scripts.zenodo_data_release.ContentComplete[source]

Bases: State

Now that we’ve uploaded all the data, we need to update metadata.

update_metadata()[source]

Copy over old metadata and update publication date.

We need to make sure there is complete metadata, including a publication date.

To do this, we:

  1. use the legacy API to get the concept record ID associated with the draft

  2. use the new API to get the latest record associated with the concept

  3. use the legacy API to get the metadata from the latest record

  4. use the legacy API to update the draft’s metadata

Since we are using the legacy API to publish, we need the legacy metadata format. But the legacy concept DOI -> published record mapping is broken, so we have to take a detour through the new API.

class pudl.scripts.zenodo_data_release.CompleteDraft[source]

Bases: State

Now that we’ve uploaded all the data, we can publish.

publish() None[source]

Publish the draft.

get_html_url()[source]

A URL for viewing this draft.

pudl.scripts.zenodo_data_release.pudl_zenodo_data_release(env: str, source_dir: str, publish: bool, ignore: tuple[str])[source]

Publish a new PUDL data release to Zenodo.