Skip to content

Add unified version registry for GCS and HF dataset versioning #600

@anth-volk

Description

@anth-volk

Problem

There is no unified way to track all dataset versions across GCS and Hugging Face. Version information is scattered — GCS uses blob metadata, HF uses git tags — and there's no single source of truth that maps a semver string to the specific GCS generation numbers and HF commit SHAs for that release. This makes rollback discovery difficult and prevents consumers from programmatically querying the current data version.

Solution

Introduce a version registry (version_manifest.json) that:

  • Lives on both GCS and HF as a single file containing all version entries
  • Maps each semver version to its GCS generation numbers and HF commit SHA
  • Provides a current pointer to the latest deployed version
  • Supports rollback by treating it as a new release with special_operation="roll-back" metadata
  • Exposes a public consumer API (get_data_version(), get_data_manifest()) that fetches the registry from HF without credentials

Key design decisions

  1. Single registry file — all versions in one version_manifest.json, not per-version blobs
  2. Backend separationupload_manifest() orchestrates, delegating to _upload_registry_to_gcs() and _upload_registry_to_hf()
  3. Rollback-as-release — rolling back copies old data to new GCS generations and HF commits, then publishes a new version entry
  4. Consumer API at package levelfrom policyengine_us_data import get_data_version, get_data_manifest

Type hierarchy

VersionRegistry
  ├── current: str
  └── versions: list[VersionManifest]
        ├── version: str
        ├── created_at: str
        ├── hf: HFVersionInfo (repo + commit SHA)
        ├── gcs: GCSVersionInfo (bucket + generation map)
        ├── special_operation: str?
        └── roll_back_version: str?

Files

  • policyengine_us_data/utils/gcs_version.py — core module (types, registry I/O, query functions, rollback, consumer API)
  • policyengine_us_data/utils/data_upload.py — modified to return generations/commits and build manifests
  • policyengine_us_data/__init__.py — exports get_data_version and get_data_manifest
  • policyengine_us_data/tests/test_gcs_version.py — 39 tests covering all functionality

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions