Add configurable data roots and asset download CLI#209
Conversation
- Make EMBODICHAIN_DEFAULT_DATA_ROOT configurable via EMBODICHAIN_DATA_ROOT environment variable, with fallback to ~/.cache/embodichain_data. - Make EMBODICHAIN_DEFAULT_DATASET_ROOT configurable via EMBODICHAIN_DATASET_ROOT environment variable. - Update get_data_path to resolve paths from the local data root before falling back to the data-class download mechanism. - Add embodichain.data.download CLI for listing and pre-downloading assets by name, category, or all at once. - Add __main__.py so the CLI can be invoked as python -m embodichain.data. - Add Data Assets documentation page under docs/source/resources/. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds configurable data/dataset root directories via environment variables and introduces a CLI for listing/downloading EmbodiChain data assets ahead of runtime usage.
Changes:
- Make data roots configurable via
EMBODICHAIN_DATA_ROOTandEMBODICHAIN_DATASET_ROOT. - Add a
python -m embodichain.data ...CLI to list assets and pre-download them by name/category/all. - Add documentation describing asset storage, configuration, and CLI usage.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
embodichain/data/constants.py |
Read default data roots from env vars with fallbacks to ~/.cache/.... |
embodichain/data/dataset.py |
Update get_data_path() to try resolving locally before triggering download logic. |
embodichain/data/download.py |
New CLI for asset listing and pre-download; includes non-zip “extract” copying helper. |
embodichain/data/__main__.py |
Enables python -m embodichain.data entry point to the download CLI. |
embodichain/data/assets/robot_assets.py |
Adjust CartPole dataset prefix to align with class name. |
docs/source/resources/assets.md |
New docs page for asset roots and the download CLI. |
docs/source/index.rst |
Add the new assets doc page to the Resources TOC. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Try resolving under the user-configurable data root first | ||
| from embodichain.data.constants import EMBODICHAIN_DEFAULT_DATA_ROOT | ||
|
|
||
| local_path = os.path.join(EMBODICHAIN_DEFAULT_DATA_ROOT, data_path_in_config) | ||
| if os.path.exists(local_path): | ||
| return local_path |
There was a problem hiding this comment.
get_data_path's new local-root fast path checks <data_root>/<dataset>/<subpath>, but assets managed by EmbodiChainDataset are extracted under <data_root>/extract/<prefix>/... (see EmbodiChainDataset.check_zip() using path/extract/<prefix>). As written, locally staged extracted assets under the normal extract/ tree will not be found and the function will still fall back to instantiating the download class. Consider checking <data_root>/extract/<dataset_name>/<subpath> (and optionally also <data_root>/<dataset>/<subpath> for backwards/manual layouts) before downloading.
| # Try resolving under the user-configurable data root first | ||
| from embodichain.data.constants import EMBODICHAIN_DEFAULT_DATA_ROOT | ||
|
|
||
| local_path = os.path.join(EMBODICHAIN_DEFAULT_DATA_ROOT, data_path_in_config) | ||
| if os.path.exists(local_path): | ||
| return local_path |
There was a problem hiding this comment.
The new local-root resolution returns os.path.join(EMBODICHAIN_DEFAULT_DATA_ROOT, data_path_in_config) if it exists, which allows .. path traversal (e.g., ../../etc/passwd) to escape the data root and be returned if present. If data_path_in_config can come from untrusted configs, normalize the candidate path (e.g., Path(...).resolve()) and enforce it remains within the intended data root before returning it.
| print("Use 'list' to see available assets.", file=sys.stderr) | ||
| sys.exit(1) | ||
| _category, cls = result | ||
| targets.append((args.name, cls)) |
There was a problem hiding this comment.
In the --name path you append (args.name, cls) to targets. Since find_asset_class() is case-insensitive, args.name may not match the actual class name/prefix casing used by Open3D, causing _ensure_extract() to look in the wrong <data_root>/download/<prefix> and skip copying non-zip assets. Use the canonical class name (e.g., cls.__name__) when building targets and when printing status.
| targets.append((args.name, cls)) | |
| targets.append((cls.__name__, cls)) |
| EMBODICHAIN_DEFAULT_DATA_ROOT = os.environ.get( | ||
| "EMBODICHAIN_DATA_ROOT", str(Path.home() / ".cache" / "embodichain_data") | ||
| ) | ||
| EMBODICHAIN_DEFAULT_DATASET_ROOT = os.environ.get( | ||
| "EMBODICHAIN_DATASET_ROOT", str(Path.home() / ".cache" / "embodichain_datasets") | ||
| ) |
There was a problem hiding this comment.
Paths from EMBODICHAIN_DATA_ROOT / EMBODICHAIN_DATASET_ROOT are used verbatim. If a user sets values like ~/embodichain_data, ~ won't be expanded and relative paths won't be normalized, which can lead to confusing behavior across platforms. Consider Path(env).expanduser() (and possibly resolve() when appropriate) before converting to str.
| EMBODICHAIN_DEFAULT_DATA_ROOT = os.environ.get( | |
| "EMBODICHAIN_DATA_ROOT", str(Path.home() / ".cache" / "embodichain_data") | |
| ) | |
| EMBODICHAIN_DEFAULT_DATASET_ROOT = os.environ.get( | |
| "EMBODICHAIN_DATASET_ROOT", str(Path.home() / ".cache" / "embodichain_datasets") | |
| ) | |
| _env_data_root = os.environ.get("EMBODICHAIN_DATA_ROOT") | |
| if _env_data_root is not None: | |
| EMBODICHAIN_DEFAULT_DATA_ROOT = str(Path(_env_data_root).expanduser().resolve()) | |
| else: | |
| EMBODICHAIN_DEFAULT_DATA_ROOT = str(Path.home() / ".cache" / "embodichain_data") | |
| _env_dataset_root = os.environ.get("EMBODICHAIN_DATASET_ROOT") | |
| if _env_dataset_root is not None: | |
| EMBODICHAIN_DEFAULT_DATASET_ROOT = str( | |
| Path(_env_dataset_root).expanduser().resolve() | |
| ) | |
| else: | |
| EMBODICHAIN_DEFAULT_DATASET_ROOT = str( | |
| Path.home() / ".cache" / "embodichain_datasets" | |
| ) |
| def get_data_path(data_path_in_config: str) -> str: | ||
| """Get the absolute path of the data file. | ||
|
|
||
| Resolution order: | ||
| 1. If ``data_path_in_config`` is an absolute path, return it directly. | ||
| 2. If a matching file/directory exists under ``EMBODICHAIN_DEFAULT_DATA_ROOT`` | ||
| (which can be overridden via the ``EMBODICHAIN_DATA_ROOT`` environment | ||
| variable), return that path. | ||
| 3. Otherwise, resolve via the registered data-class download mechanism. | ||
|
|
||
| Args: | ||
| data_path_in_config (str): The dataset path in the format "${dataset_name}/subpath". | ||
| data_path_in_config (str): The dataset path in the format | ||
| ``"dataset_name/subpath"``. | ||
|
|
||
| Returns: | ||
| str: The absolute path of the data file. | ||
| """ | ||
| if os.path.isabs(data_path_in_config): | ||
| return data_path_in_config | ||
|
|
||
| # Try resolving under the user-configurable data root first | ||
| from embodichain.data.constants import EMBODICHAIN_DEFAULT_DATA_ROOT | ||
|
|
||
| local_path = os.path.join(EMBODICHAIN_DEFAULT_DATA_ROOT, data_path_in_config) | ||
| if os.path.exists(local_path): | ||
| return local_path | ||
|
|
||
| # Fall back to the data-class download mechanism |
There was a problem hiding this comment.
New path-resolution behavior (local-root short-circuit + env-var-configurable roots) is not covered by tests. Adding a small unit test that sets EMBODICHAIN_DATA_ROOT to a temp dir and verifies get_data_path() returns an existing staged file without instantiating/downloading the dataset class would prevent regressions.
| def main() -> None: | ||
| parser = argparse.ArgumentParser( | ||
| prog="embodichain.data.download", | ||
| description="Pre-download EmbodiChain data assets from HuggingFace.", | ||
| ) | ||
| subparsers = parser.add_subparsers(dest="command") | ||
|
|
||
| # --- list --- | ||
| list_parser = subparsers.add_parser("list", help="List available assets.") | ||
| list_parser.add_argument( | ||
| "--category", | ||
| choices=sorted(CATEGORY_MODULES), | ||
| help="Show only assets in this category.", | ||
| ) | ||
|
|
||
| # --- download --- | ||
| dl_parser = subparsers.add_parser("download", help="Download assets.") | ||
| dl_group = dl_parser.add_mutually_exclusive_group(required=True) | ||
| dl_group.add_argument("--name", help="Download a single asset by class name.") | ||
| dl_group.add_argument( | ||
| "--category", | ||
| choices=sorted(CATEGORY_MODULES), | ||
| help="Download all assets in a category.", | ||
| ) | ||
| dl_group.add_argument( | ||
| "--all", action="store_true", help="Download every registered asset." | ||
| ) | ||
|
|
||
| args = parser.parse_args() | ||
| if args.command == "list": | ||
| cmd_list(args) | ||
| elif args.command == "download": | ||
| cmd_download(args) | ||
| else: | ||
| parser.print_help() | ||
| sys.exit(1) |
There was a problem hiding this comment.
The new download CLI module has no automated tests. Since the repo already uses pytest, consider adding CLI tests (e.g., using subprocess or calling main() with patched sys.argv) to cover list output for a known category and error handling for invalid/missing args, without requiring real network downloads (mock the dataset constructors).
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 14 out of 14 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| local_path = os.path.join(EMBODICHAIN_DEFAULT_DATA_ROOT, data_path_in_config) | ||
| if os.path.exists(local_path): | ||
| return local_path | ||
|
|
There was a problem hiding this comment.
get_data_path checks for a local staged file at <data_root>/<dataset>/<subpath>, but assets managed by EmbodiChainDataset are extracted under <data_root>/extract/<prefix>/.... As written, this local check will miss already-extracted assets and still instantiate the dataset class (potentially triggering download logic) on every call. Consider checking <data_root>/extract/<data_path_in_config> (and/or both locations) before falling back to the download mechanism.
| # Also check the standard extraction layout used by EmbodiChainDataset: | |
| # <data_root>/extract/<prefix>/... | |
| extracted_local_path = os.path.join( | |
| EMBODICHAIN_DEFAULT_DATA_ROOT, "extract", data_path_in_config | |
| ) | |
| if os.path.exists(extracted_local_path): | |
| return extracted_local_path |
| # Try resolving under the user-configurable data root first | ||
| from embodichain.data.constants import EMBODICHAIN_DEFAULT_DATA_ROOT | ||
|
|
||
| local_path = os.path.join(EMBODICHAIN_DEFAULT_DATA_ROOT, data_path_in_config) | ||
| if os.path.exists(local_path): | ||
| return local_path | ||
|
|
There was a problem hiding this comment.
The new local-root resolution behavior in get_data_path is important to avoid unnecessary downloads. There are existing tests that call get_data_path, but it would be valuable to add a focused unit test that (1) sets EMBODICHAIN_DATA_ROOT to a temp dir, (2) creates a dummy file under the expected local location, and (3) asserts get_data_path returns it without instantiating a dataset class (no network / Open3D download side effects).
| EMBODICHAIN_DEFAULT_DATASET_ROOT = os.environ.get( | ||
| "EMBODICHAIN_DATASET_ROOT", str(Path.home() / ".cache" / "embodichain_datasets") | ||
| ) | ||
| EMBODICHAIN_DEFAULT_DATABASE_ROOT = str( | ||
| Path.home() / ".cache" / "embodichain" / "database" | ||
| ) |
There was a problem hiding this comment.
Now that EMBODICHAIN_DEFAULT_DATA_ROOT can point to an arbitrary path via EMBODICHAIN_DATA_ROOT, the cleanup logic in EmbodiChainDataset.check_zip() (dataset.py) becomes unreliable because its is_safe_path() hard-codes checks for substrings like embodichain_data/download and embodichain_data/extract. With a custom root, corrupted downloads/extracts may no longer be cleaned up. Consider updating the safety check to validate paths relative to the configured data root (e.g., Path(dir).resolve().is_relative_to(Path(path).resolve())) rather than matching a fixed string.
| EMBODICHAIN_DEFAULT_DATASET_ROOT = os.environ.get( | |
| "EMBODICHAIN_DATASET_ROOT", str(Path.home() / ".cache" / "embodichain_datasets") | |
| ) | |
| EMBODICHAIN_DEFAULT_DATABASE_ROOT = str( | |
| Path.home() / ".cache" / "embodichain" / "database" | |
| ) | |
| EMBODICHAIN_DEFAULT_DATA_ROOT_PATH = Path( | |
| EMBODICHAIN_DEFAULT_DATA_ROOT | |
| ).expanduser().resolve() | |
| EMBODICHAIN_DEFAULT_DATASET_ROOT = os.environ.get( | |
| "EMBODICHAIN_DATASET_ROOT", | |
| str(Path.home() / ".cache" / "embodichain_datasets"), | |
| ) | |
| EMBODICHAIN_DEFAULT_DATASET_ROOT_PATH = Path( | |
| EMBODICHAIN_DEFAULT_DATASET_ROOT | |
| ).expanduser().resolve() | |
| EMBODICHAIN_DEFAULT_DATABASE_ROOT = str( | |
| Path.home() / ".cache" / "embodichain" / "database" | |
| ) | |
| EMBODICHAIN_DEFAULT_DATABASE_ROOT_PATH = Path( | |
| EMBODICHAIN_DEFAULT_DATABASE_ROOT | |
| ).expanduser().resolve() |
| extract_dir = os.path.join(EMBODICHAIN_DEFAULT_DATA_ROOT, "extract", prefix) | ||
| if os.path.exists(extract_dir) and os.listdir(extract_dir): | ||
| return # already extracted | ||
|
|
||
| download_dir = os.path.join(EMBODICHAIN_DEFAULT_DATA_ROOT, "download", prefix) |
There was a problem hiding this comment.
_ensure_extract accepts data_obj but never uses it. Either drop the parameter or (preferably) use data_obj.download_dir / data_obj.extract_dir to compute the copy locations so this helper stays correct even if a dataset is instantiated with a non-default data root.
| extract_dir = os.path.join(EMBODICHAIN_DEFAULT_DATA_ROOT, "extract", prefix) | |
| if os.path.exists(extract_dir) and os.listdir(extract_dir): | |
| return # already extracted | |
| download_dir = os.path.join(EMBODICHAIN_DEFAULT_DATA_ROOT, "download", prefix) | |
| # Infer the data root from the dataset object so we respect custom data roots. | |
| extract_base = getattr(data_obj, "extract_dir", None) | |
| if extract_base is not None: | |
| # Typical Open3D layout: <data_root>/extract/<something> | |
| data_root = os.path.dirname(os.path.dirname(extract_base)) | |
| else: | |
| # Fallback to the project default if the attribute is missing. | |
| data_root = EMBODICHAIN_DEFAULT_DATA_ROOT | |
| extract_dir = os.path.join(data_root, "extract", prefix) | |
| if os.path.exists(extract_dir) and os.listdir(extract_dir): | |
| return # already extracted | |
| download_dir = os.path.join(data_root, "download", prefix) |
| | Category | Description | Examples | | ||
| |-------------|------------------------------------------------|-------------------------------------------------| | ||
| | `robot` | Robot URDF models | CobotMagicArm, Franka, UniversalRobots, UnitreeH1 | | ||
| | `eef` | End-effector / gripper models | DH_PGC_140_50, Robotiq2F85, InspireHand | | ||
| | `obj` | Manipulable objects and furniture | ShopTableSimple, CoffeeCup, TableWare | | ||
| | `scene` | Full scene environments | SceneData, EmptyRoom | | ||
| | `materials` | Rendering materials, IBL, and backgrounds | SimResources, CocoBackground | | ||
| | `w1` | DexForce W1 humanoid robot and components | DexforceW1V021, DexforceW1ChassisV021 | | ||
| | `demo` | Demo scene data | ScoopIceNewEnv, MultiW1Data | | ||
|
|
There was a problem hiding this comment.
The markdown table under "Asset Categories" is written with double pipes (|| ...) which will render as an extra empty column in most markdown renderers. Use single-pipe table syntax (| Category | Description | Examples | etc.) so the table renders correctly in the docs site.
| _ensure_extract(data_obj, cls_name) | ||
| print(f" ✓ {cls_name} ready") | ||
| except Exception as exc: | ||
| print(f" ✗ {cls_name} failed: {exc}", file=sys.stderr) |
There was a problem hiding this comment.
download_asset() swallows all exceptions and only prints an error, so the overall download command will still exit with status 0 even if one or more assets failed to download/extract. For CI/automation use, consider tracking failures and exiting non-zero (or re-raising) when any target fails.
| print(f" ✗ {cls_name} failed: {exc}", file=sys.stderr) | |
| print(f" ✗ {cls_name} failed: {exc}", file=sys.stderr) | |
| raise |
Description
Add environment-variable-configurable data roots and a CLI tool for pre-downloading EmbodiChain data assets from HuggingFace.
Changes:
EMBODICHAIN_DEFAULT_DATA_ROOTnow reads fromEMBODICHAIN_DATA_ROOTenv var (fallback:~/.cache/embodichain_data)EMBODICHAIN_DEFAULT_DATASET_ROOTnow reads fromEMBODICHAIN_DATASET_ROOTenv var (fallback:~/.cache/embodichain_datasets)get_data_path()now checks the local data root before falling back to the download mechanism, so users can stage assets locally without triggering downloadsembodichain.data.downloadCLI module (python -m embodichain.data) supporting:list— show all available assets with download status, filterable by categorydownload --name/--category/--all— pre-download assets by name, category, or everythingdocs/source/resources/assets.mdType of change
Checklist
black .command to format the code base.🤖 Generated with Claude Code