pudl.scripts.zenodo_data_release¶
Upload a prepared PUDL data release directory to Zenodo.
The PUDL data release process produces a directory of artifacts (zipped Parquet files, SQLite databases, JSON metadata, logs, etc.) that are uploaded to CERN’s Zenodo data repository for long-term archival access. Each new versioned release of PUDL is associated with the same original PUDL concept DOI.
This module provides a CLI that handles the process of uploading a new PUDL data release to Zenodo, given a prepared directory of artifacts typically produced by the PUDL builds.
It uses state objects to ensure that Zenodo API calls happen in a valid order. The files
to upload are read using fsspec and remote files are staged locally one at a time
so uploads can be retried, but without using excessive local disk space.
Retries are implemented for all upload requests to recover from transient network issues and Zenodo server flakiness. Zero-byte uploads are prevented.
NOTE: PUDL nightly build outputs are NOT suitable for producing a Zenodo data release unless the Parquet outputs are filtered out with an appropriate ignore_regex. Double check what files should actually be distributed before running the script.
Run zenodo_data_release --help for CLI usage instructions.
Attributes¶
Classes¶
!!! abstract "Usage Documentation" |
|
!!! abstract "Usage Documentation" |
|
!!! abstract "Usage Documentation" |
|
!!! abstract "Usage Documentation" |
|
!!! abstract "Usage Documentation" |
|
Thin wrapper over Zenodo REST API. |
|
Parent class for dataset states. |
|
Represent initial dataset state. |
|
We can only sync the directory once we've gotten an empty draft. |
|
Now that we've uploaded all the data, we need to update metadata. |
|
Now that we've uploaded all the data, we can publish. |
Functions¶
|
Publish a new PUDL data release to Zenodo. |
Module Contents¶
- class pudl.scripts.zenodo_data_release._LegacyLinks(/, **data: Any)[source]¶
Bases:
pydantic.BaseModel- !!! abstract “Usage Documentation”
[Models](../concepts/models.md)
A base class for creating Pydantic models.
- __pydantic_complete__[source]¶
Whether model building is completed, or if there are still undefined fields.
- __pydantic_decorators__[source]¶
Metadata containing the decorators defined on the model. This replaces Model.__validators__ and Model.__root_validators__ from Pydantic V1.
- __pydantic_generic_metadata__[source]¶
Metadata for generic models; contains data used for a similar purpose to __args__, __origin__, __parameters__ in typing-module generics. May eventually be replaced by these.
- __pydantic_parent_namespace__[source]¶
Parent namespace of the model, used for automatic rebuilding of models.
- __pydantic_serializer__[source]¶
The pydantic-core SchemaSerializer used to dump instances of the model.
- __pydantic_validator__[source]¶
The pydantic-core SchemaValidator used to validate instances of the model.
- __pydantic_fields__[source]¶
A dictionary of field names and their corresponding [FieldInfo][pydantic.fields.FieldInfo] objects.
- __pydantic_computed_fields__[source]¶
A dictionary of computed field names and their corresponding [ComputedFieldInfo][pydantic.fields.ComputedFieldInfo] objects.
- __pydantic_extra__[source]¶
A dictionary containing extra values, if [extra][pydantic.config.ConfigDict.extra] is set to ‘allow’.
- bucket: pydantic.AnyHttpUrl[source]¶
- class pudl.scripts.zenodo_data_release._LegacyMetadata(/, **data: Any)[source]¶
Bases:
pydantic.BaseModel- !!! abstract “Usage Documentation”
[Models](../concepts/models.md)
A base class for creating Pydantic models.
- __pydantic_complete__[source]¶
Whether model building is completed, or if there are still undefined fields.
- __pydantic_decorators__[source]¶
Metadata containing the decorators defined on the model. This replaces Model.__validators__ and Model.__root_validators__ from Pydantic V1.
- __pydantic_generic_metadata__[source]¶
Metadata for generic models; contains data used for a similar purpose to __args__, __origin__, __parameters__ in typing-module generics. May eventually be replaced by these.
- __pydantic_parent_namespace__[source]¶
Parent namespace of the model, used for automatic rebuilding of models.
- __pydantic_serializer__[source]¶
The pydantic-core SchemaSerializer used to dump instances of the model.
- __pydantic_validator__[source]¶
The pydantic-core SchemaValidator used to validate instances of the model.
- __pydantic_fields__[source]¶
A dictionary of field names and their corresponding [FieldInfo][pydantic.fields.FieldInfo] objects.
- __pydantic_computed_fields__[source]¶
A dictionary of computed field names and their corresponding [ComputedFieldInfo][pydantic.fields.ComputedFieldInfo] objects.
- class pudl.scripts.zenodo_data_release._LegacyDeposition(/, **data: Any)[source]¶
Bases:
pydantic.BaseModel- !!! abstract “Usage Documentation”
[Models](../concepts/models.md)
A base class for creating Pydantic models.
- __pydantic_complete__[source]¶
Whether model building is completed, or if there are still undefined fields.
- __pydantic_decorators__[source]¶
Metadata containing the decorators defined on the model. This replaces Model.__validators__ and Model.__root_validators__ from Pydantic V1.
- __pydantic_generic_metadata__[source]¶
Metadata for generic models; contains data used for a similar purpose to __args__, __origin__, __parameters__ in typing-module generics. May eventually be replaced by these.
- __pydantic_parent_namespace__[source]¶
Parent namespace of the model, used for automatic rebuilding of models.
- __pydantic_serializer__[source]¶
The pydantic-core SchemaSerializer used to dump instances of the model.
- __pydantic_validator__[source]¶
The pydantic-core SchemaValidator used to validate instances of the model.
- __pydantic_fields__[source]¶
A dictionary of field names and their corresponding [FieldInfo][pydantic.fields.FieldInfo] objects.
- __pydantic_computed_fields__[source]¶
A dictionary of computed field names and their corresponding [ComputedFieldInfo][pydantic.fields.ComputedFieldInfo] objects.
- __pydantic_extra__[source]¶
A dictionary containing extra values, if [extra][pydantic.config.ConfigDict.extra] is set to ‘allow’.
- links: _LegacyLinks[source]¶
- metadata: _LegacyMetadata[source]¶
- class pudl.scripts.zenodo_data_release._NewFile(/, **data: Any)[source]¶
Bases:
pydantic.BaseModel- !!! abstract “Usage Documentation”
[Models](../concepts/models.md)
A base class for creating Pydantic models.
- __pydantic_complete__[source]¶
Whether model building is completed, or if there are still undefined fields.
- __pydantic_decorators__[source]¶
Metadata containing the decorators defined on the model. This replaces Model.__validators__ and Model.__root_validators__ from Pydantic V1.
- __pydantic_generic_metadata__[source]¶
Metadata for generic models; contains data used for a similar purpose to __args__, __origin__, __parameters__ in typing-module generics. May eventually be replaced by these.
- __pydantic_parent_namespace__[source]¶
Parent namespace of the model, used for automatic rebuilding of models.
- __pydantic_serializer__[source]¶
The pydantic-core SchemaSerializer used to dump instances of the model.
- __pydantic_validator__[source]¶
The pydantic-core SchemaValidator used to validate instances of the model.
- __pydantic_fields__[source]¶
A dictionary of field names and their corresponding [FieldInfo][pydantic.fields.FieldInfo] objects.
- __pydantic_computed_fields__[source]¶
A dictionary of computed field names and their corresponding [ComputedFieldInfo][pydantic.fields.ComputedFieldInfo] objects.
- class pudl.scripts.zenodo_data_release._NewRecord(/, **data: Any)[source]¶
Bases:
pydantic.BaseModel- !!! abstract “Usage Documentation”
[Models](../concepts/models.md)
A base class for creating Pydantic models.
- __pydantic_complete__[source]¶
Whether model building is completed, or if there are still undefined fields.
- __pydantic_decorators__[source]¶
Metadata containing the decorators defined on the model. This replaces Model.__validators__ and Model.__root_validators__ from Pydantic V1.
- __pydantic_generic_metadata__[source]¶
Metadata for generic models; contains data used for a similar purpose to __args__, __origin__, __parameters__ in typing-module generics. May eventually be replaced by these.
- __pydantic_parent_namespace__[source]¶
Parent namespace of the model, used for automatic rebuilding of models.
- __pydantic_serializer__[source]¶
The pydantic-core SchemaSerializer used to dump instances of the model.
- __pydantic_validator__[source]¶
The pydantic-core SchemaValidator used to validate instances of the model.
- __pydantic_fields__[source]¶
A dictionary of field names and their corresponding [FieldInfo][pydantic.fields.FieldInfo] objects.
- __pydantic_computed_fields__[source]¶
A dictionary of computed field names and their corresponding [ComputedFieldInfo][pydantic.fields.ComputedFieldInfo] objects.
- class pudl.scripts.zenodo_data_release.ZenodoClient(env: str)[source]¶
Thin wrapper over Zenodo REST API.
Mostly legacy calls (https://developers.zenodo.org/) (archive: https://web.archive.org/web/20231212025359/https://developers.zenodo.org/) but due to inconsistent behavior of legacy API on sandbox environment, we need some of the unreleased new API endpoints too: https://inveniordm.docs.cern.ch/reference/rest_api_drafts_records/
- retry_request(*, method, url, max_tries: int = 6, request_timeout: float | None = None, data_factory: collections.abc.Callable[[], IO[bytes]] | None = None, **kwargs) requests.Response[source]¶
Retry calls to
requests.requestwith exponential backoff.- Parameters:
method – HTTP method to use for the request (e.g.
GET).url – Fully-qualified URL to which the request is sent.
max_tries – Maximum number of attempts before surfacing an error.
request_timeout – Optional per-request timeout in seconds. When
Nonethe timeout grows exponentially (2**attempt).data_factory – Optional callable that yields a fresh binary stream for each attempt. Useful for uploads that require reopening a file-like object.
**kwargs – Additional keyword arguments passed through directly to
requests.request.
- Returns:
The
requests.Responseproduced by the successful attempt.- Raises:
requests.RequestException – If all attempts fail with a requests error.
OSError – If reading from disk fails when preparing a payload.
RuntimeError – If no response object is produced (should be rare).
- get_deposition(deposition_id: int) _LegacyDeposition[source]¶
LEGACY API: Get JSON describing a deposition.
Depositions can be published or unpublished.
- get_record(record_id: int) _NewRecord[source]¶
NEW API: Get JSON describing a record.
All records are published records.
- new_record_version(record_id: int) _NewRecord[source]¶
NEW API: get or create the draft associated with a record ID.
Finds the latest record in the concept that record_id points to, and makes a new version unless one exists already.
- update_deposition_metadata(deposition_id: int, metadata: _LegacyMetadata) _LegacyDeposition[source]¶
LEGACY API: Update deposition metadata.
Replaces the existing metadata completely - so make sure to pass in complete metadata. You cannot update metadata fields one at a time.
- delete_deposition_file(deposition_id: int, file_id) requests.Response[source]¶
LEGACY API: Delete file from deposition.
Note: file_id is not always the file name.
- create_bucket_file(bucket_url: pydantic.AnyHttpUrl, file_path: pathlib.Path, max_tries: int = 6) requests.Response[source]¶
LEGACY API: Upload a file to a deposition’s file bucket.
We prefer this API this over the /deposit/depositions/{id}/files endpoint because it allows for files >100MB.
- Parameters:
bucket_url – Upload destination returned by Zenodo for the draft.
file_path – Local path to the artifact being uploaded.
max_tries – Maximum number of upload attempts before failing.
- Returns:
The
requests.Responsefrom the successful upload attempt.- Raises:
ValueError – If
file_pathis empty.requests.RequestException – If all upload attempts fail.
- publish_deposition(deposition_id: int) _LegacyDeposition[source]¶
LEGACY API: publish deposition.
- class pudl.scripts.zenodo_data_release.State[source]¶
Parent class for dataset states.
Provides an abstraction layer that hides Zenodo’s data model from the caller.
Subclasses + their limited method definitions provide a way to avoid calling the operations in the wrong order.
- zenodo_client: ZenodoClient[source]¶
- class pudl.scripts.zenodo_data_release.InitialDataset[source]¶
Bases:
StateRepresent initial dataset state.
At this point, we don’t know if there is an existing draft or not - the only thing we can do is try to get a fresh draft.
- get_empty_draft() EmptyDraft[source]¶
Get an empty draft for this dataset.
Use new API to get any draft, then use legacy API to delete any files in the draft.
- class pudl.scripts.zenodo_data_release.EmptyDraft[source]¶
Bases:
StateWe can only sync the directory once we’ve gotten an empty draft.
- static _sync_local_path(openable_file: fsspec.core.OpenFile, staging_dir: pathlib.Path) pathlib.Path[source]¶
Ensure the given
fsspecfile exists on the local filesystem.When
openable_filealready resides on the local filesystem we avoid copying and return its existing path. Remote files are downloaded intostaging_dir(a shared temporary directory) so the rest of the upload pipeline can treat every artifact as a simplePathwithout caring where it came from.- Parameters:
openable_file –
fsspechandle pointing to the source artifact.staging_dir – Directory used to cache remote files locally.
- Returns:
A
Pathpointing to a readable local copy ofopenable_file.
- sync_directory(source_dir: str, ignore: tuple[str]) ContentComplete[source]¶
Upload every file in
source_dirto the draft bucket.The method enumerates files (not subdirectories) via
fsspecso the source can live on local disk, GCS, S3, etc. Remote objects are first staged into a temporary directory to ensure uploads always come from localPathobjects that can be rewound for retries. Regex patterns provided viaignoreare applied to the full path of each candidate file, allowing us to drop logs, intermediate data, or other nightlies-only artifacts before hitting Zenodo.- Parameters:
source_dir – Directory (local or remote) whose contents will be sent to Zenodo.
ignore – Tuple of regex patterns; any path matching one is skipped.
- Returns:
A
ContentCompletestate ready for metadata updates.
- class pudl.scripts.zenodo_data_release.ContentComplete[source]¶
Bases:
StateNow that we’ve uploaded all the data, we need to update metadata.
- update_metadata()[source]¶
Copy over old metadata and update publication date.
We need to make sure there is complete metadata, including a publication date.
To do this, we:
use the legacy API to get the concept record ID associated with the draft
use the new API to get the latest record associated with the concept
use the legacy API to get the metadata from the latest record
use the legacy API to update the draft’s metadata
Since we are using the legacy API to publish, we need the legacy metadata format. But the legacy concept DOI -> published record mapping is broken, so we have to take a detour through the new API.