-
Notifications
You must be signed in to change notification settings - Fork 10
Open
Description
Problem
There is no unified way to track all dataset versions across GCS and Hugging Face. Version information is scattered — GCS uses blob metadata, HF uses git tags — and there's no single source of truth that maps a semver string to the specific GCS generation numbers and HF commit SHAs for that release. This makes rollback discovery difficult and prevents consumers from programmatically querying the current data version.
Solution
Introduce a version registry (version_manifest.json) that:
- Lives on both GCS and HF as a single file containing all version entries
- Maps each semver version to its GCS generation numbers and HF commit SHA
- Provides a
currentpointer to the latest deployed version - Supports rollback by treating it as a new release with
special_operation="roll-back"metadata - Exposes a public consumer API (
get_data_version(),get_data_manifest()) that fetches the registry from HF without credentials
Key design decisions
- Single registry file — all versions in one
version_manifest.json, not per-version blobs - Backend separation —
upload_manifest()orchestrates, delegating to_upload_registry_to_gcs()and_upload_registry_to_hf() - Rollback-as-release — rolling back copies old data to new GCS generations and HF commits, then publishes a new version entry
- Consumer API at package level —
from policyengine_us_data import get_data_version, get_data_manifest
Type hierarchy
VersionRegistry
├── current: str
└── versions: list[VersionManifest]
├── version: str
├── created_at: str
├── hf: HFVersionInfo (repo + commit SHA)
├── gcs: GCSVersionInfo (bucket + generation map)
├── special_operation: str?
└── roll_back_version: str?
Files
policyengine_us_data/utils/gcs_version.py— core module (types, registry I/O, query functions, rollback, consumer API)policyengine_us_data/utils/data_upload.py— modified to return generations/commits and build manifestspolicyengine_us_data/__init__.py— exportsget_data_versionandget_data_manifestpolicyengine_us_data/tests/test_gcs_version.py— 39 tests covering all functionality
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels