Skip to content

Add configurable data roots and asset download CLI#209

Merged
yuecideng merged 3 commits intomainfrom
feat/data-download-cli
Mar 31, 2026
Merged

Add configurable data roots and asset download CLI#209
yuecideng merged 3 commits intomainfrom
feat/data-download-cli

Conversation

@yuecideng
Copy link
Copy Markdown
Contributor

Description

Add environment-variable-configurable data roots and a CLI tool for pre-downloading EmbodiChain data assets from HuggingFace.

Changes:

  • EMBODICHAIN_DEFAULT_DATA_ROOT now reads from EMBODICHAIN_DATA_ROOT env var (fallback: ~/.cache/embodichain_data)
  • EMBODICHAIN_DEFAULT_DATASET_ROOT now reads from EMBODICHAIN_DATASET_ROOT env var (fallback: ~/.cache/embodichain_datasets)
  • get_data_path() now checks the local data root before falling back to the download mechanism, so users can stage assets locally without triggering downloads
  • New embodichain.data.download CLI module (python -m embodichain.data) supporting:
    • list — show all available assets with download status, filterable by category
    • download --name/--category/--all — pre-download assets by name, category, or everything
  • Documentation page added under docs/source/resources/assets.md

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (non-breaking change which improves an existing functionality)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (existing functionality will not work without user modification)
  • Documentation update

Checklist

  • I have run the black . command to format the code base.
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • Dependencies have been updated, if applicable.

🤖 Generated with Claude Code

- Make EMBODICHAIN_DEFAULT_DATA_ROOT configurable via EMBODICHAIN_DATA_ROOT
  environment variable, with fallback to ~/.cache/embodichain_data.
- Make EMBODICHAIN_DEFAULT_DATASET_ROOT configurable via
  EMBODICHAIN_DATASET_ROOT environment variable.
- Update get_data_path to resolve paths from the local data root before
  falling back to the data-class download mechanism.
- Add embodichain.data.download CLI for listing and pre-downloading
  assets by name, category, or all at once.
- Add __main__.py so the CLI can be invoked as python -m embodichain.data.
- Add Data Assets documentation page under docs/source/resources/.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 31, 2026 03:55
@yuecideng yuecideng added the data Related to data module label Mar 31, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds configurable data/dataset root directories via environment variables and introduces a CLI for listing/downloading EmbodiChain data assets ahead of runtime usage.

Changes:

  • Make data roots configurable via EMBODICHAIN_DATA_ROOT and EMBODICHAIN_DATASET_ROOT.
  • Add a python -m embodichain.data ... CLI to list assets and pre-download them by name/category/all.
  • Add documentation describing asset storage, configuration, and CLI usage.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
embodichain/data/constants.py Read default data roots from env vars with fallbacks to ~/.cache/....
embodichain/data/dataset.py Update get_data_path() to try resolving locally before triggering download logic.
embodichain/data/download.py New CLI for asset listing and pre-download; includes non-zip “extract” copying helper.
embodichain/data/__main__.py Enables python -m embodichain.data entry point to the download CLI.
embodichain/data/assets/robot_assets.py Adjust CartPole dataset prefix to align with class name.
docs/source/resources/assets.md New docs page for asset roots and the download CLI.
docs/source/index.rst Add the new assets doc page to the Resources TOC.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +173 to +178
# Try resolving under the user-configurable data root first
from embodichain.data.constants import EMBODICHAIN_DEFAULT_DATA_ROOT

local_path = os.path.join(EMBODICHAIN_DEFAULT_DATA_ROOT, data_path_in_config)
if os.path.exists(local_path):
return local_path
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_data_path's new local-root fast path checks <data_root>/<dataset>/<subpath>, but assets managed by EmbodiChainDataset are extracted under <data_root>/extract/<prefix>/... (see EmbodiChainDataset.check_zip() using path/extract/<prefix>). As written, locally staged extracted assets under the normal extract/ tree will not be found and the function will still fall back to instantiating the download class. Consider checking <data_root>/extract/<dataset_name>/<subpath> (and optionally also <data_root>/<dataset>/<subpath> for backwards/manual layouts) before downloading.

Copilot uses AI. Check for mistakes.
Comment on lines +173 to +178
# Try resolving under the user-configurable data root first
from embodichain.data.constants import EMBODICHAIN_DEFAULT_DATA_ROOT

local_path = os.path.join(EMBODICHAIN_DEFAULT_DATA_ROOT, data_path_in_config)
if os.path.exists(local_path):
return local_path
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new local-root resolution returns os.path.join(EMBODICHAIN_DEFAULT_DATA_ROOT, data_path_in_config) if it exists, which allows .. path traversal (e.g., ../../etc/passwd) to escape the data root and be returned if present. If data_path_in_config can come from untrusted configs, normalize the candidate path (e.g., Path(...).resolve()) and enforce it remains within the intended data root before returning it.

Copilot uses AI. Check for mistakes.
print("Use 'list' to see available assets.", file=sys.stderr)
sys.exit(1)
_category, cls = result
targets.append((args.name, cls))
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the --name path you append (args.name, cls) to targets. Since find_asset_class() is case-insensitive, args.name may not match the actual class name/prefix casing used by Open3D, causing _ensure_extract() to look in the wrong <data_root>/download/<prefix> and skip copying non-zip assets. Use the canonical class name (e.g., cls.__name__) when building targets and when printing status.

Suggested change
targets.append((args.name, cls))
targets.append((cls.__name__, cls))

Copilot uses AI. Check for mistakes.
Comment on lines +24 to +29
EMBODICHAIN_DEFAULT_DATA_ROOT = os.environ.get(
"EMBODICHAIN_DATA_ROOT", str(Path.home() / ".cache" / "embodichain_data")
)
EMBODICHAIN_DEFAULT_DATASET_ROOT = os.environ.get(
"EMBODICHAIN_DATASET_ROOT", str(Path.home() / ".cache" / "embodichain_datasets")
)
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Paths from EMBODICHAIN_DATA_ROOT / EMBODICHAIN_DATASET_ROOT are used verbatim. If a user sets values like ~/embodichain_data, ~ won't be expanded and relative paths won't be normalized, which can lead to confusing behavior across platforms. Consider Path(env).expanduser() (and possibly resolve() when appropriate) before converting to str.

Suggested change
EMBODICHAIN_DEFAULT_DATA_ROOT = os.environ.get(
"EMBODICHAIN_DATA_ROOT", str(Path.home() / ".cache" / "embodichain_data")
)
EMBODICHAIN_DEFAULT_DATASET_ROOT = os.environ.get(
"EMBODICHAIN_DATASET_ROOT", str(Path.home() / ".cache" / "embodichain_datasets")
)
_env_data_root = os.environ.get("EMBODICHAIN_DATA_ROOT")
if _env_data_root is not None:
EMBODICHAIN_DEFAULT_DATA_ROOT = str(Path(_env_data_root).expanduser().resolve())
else:
EMBODICHAIN_DEFAULT_DATA_ROOT = str(Path.home() / ".cache" / "embodichain_data")
_env_dataset_root = os.environ.get("EMBODICHAIN_DATASET_ROOT")
if _env_dataset_root is not None:
EMBODICHAIN_DEFAULT_DATASET_ROOT = str(
Path(_env_dataset_root).expanduser().resolve()
)
else:
EMBODICHAIN_DEFAULT_DATASET_ROOT = str(
Path.home() / ".cache" / "embodichain_datasets"
)

Copilot uses AI. Check for mistakes.
Comment on lines 153 to +180
def get_data_path(data_path_in_config: str) -> str:
"""Get the absolute path of the data file.

Resolution order:
1. If ``data_path_in_config`` is an absolute path, return it directly.
2. If a matching file/directory exists under ``EMBODICHAIN_DEFAULT_DATA_ROOT``
(which can be overridden via the ``EMBODICHAIN_DATA_ROOT`` environment
variable), return that path.
3. Otherwise, resolve via the registered data-class download mechanism.

Args:
data_path_in_config (str): The dataset path in the format "${dataset_name}/subpath".
data_path_in_config (str): The dataset path in the format
``"dataset_name/subpath"``.

Returns:
str: The absolute path of the data file.
"""
if os.path.isabs(data_path_in_config):
return data_path_in_config

# Try resolving under the user-configurable data root first
from embodichain.data.constants import EMBODICHAIN_DEFAULT_DATA_ROOT

local_path = os.path.join(EMBODICHAIN_DEFAULT_DATA_ROOT, data_path_in_config)
if os.path.exists(local_path):
return local_path

# Fall back to the data-class download mechanism
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New path-resolution behavior (local-root short-circuit + env-var-configurable roots) is not covered by tests. Adding a small unit test that sets EMBODICHAIN_DATA_ROOT to a temp dir and verifies get_data_path() returns an existing staged file without instantiating/downloading the dataset class would prevent regressions.

Copilot uses AI. Check for mistakes.
Comment on lines +213 to +248
def main() -> None:
parser = argparse.ArgumentParser(
prog="embodichain.data.download",
description="Pre-download EmbodiChain data assets from HuggingFace.",
)
subparsers = parser.add_subparsers(dest="command")

# --- list ---
list_parser = subparsers.add_parser("list", help="List available assets.")
list_parser.add_argument(
"--category",
choices=sorted(CATEGORY_MODULES),
help="Show only assets in this category.",
)

# --- download ---
dl_parser = subparsers.add_parser("download", help="Download assets.")
dl_group = dl_parser.add_mutually_exclusive_group(required=True)
dl_group.add_argument("--name", help="Download a single asset by class name.")
dl_group.add_argument(
"--category",
choices=sorted(CATEGORY_MODULES),
help="Download all assets in a category.",
)
dl_group.add_argument(
"--all", action="store_true", help="Download every registered asset."
)

args = parser.parse_args()
if args.command == "list":
cmd_list(args)
elif args.command == "download":
cmd_download(args)
else:
parser.print_help()
sys.exit(1)
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new download CLI module has no automated tests. Since the repo already uses pytest, consider adding CLI tests (e.g., using subprocess or calling main() with patched sys.argv) to cover list output for a known category and error handling for invalid/missing args, without requiring real network downloads (mock the dataset constructors).

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings March 31, 2026 04:43
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

local_path = os.path.join(EMBODICHAIN_DEFAULT_DATA_ROOT, data_path_in_config)
if os.path.exists(local_path):
return local_path

Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_data_path checks for a local staged file at <data_root>/<dataset>/<subpath>, but assets managed by EmbodiChainDataset are extracted under <data_root>/extract/<prefix>/.... As written, this local check will miss already-extracted assets and still instantiate the dataset class (potentially triggering download logic) on every call. Consider checking <data_root>/extract/<data_path_in_config> (and/or both locations) before falling back to the download mechanism.

Suggested change
# Also check the standard extraction layout used by EmbodiChainDataset:
# <data_root>/extract/<prefix>/...
extracted_local_path = os.path.join(
EMBODICHAIN_DEFAULT_DATA_ROOT, "extract", data_path_in_config
)
if os.path.exists(extracted_local_path):
return extracted_local_path

Copilot uses AI. Check for mistakes.
Comment on lines +173 to +179
# Try resolving under the user-configurable data root first
from embodichain.data.constants import EMBODICHAIN_DEFAULT_DATA_ROOT

local_path = os.path.join(EMBODICHAIN_DEFAULT_DATA_ROOT, data_path_in_config)
if os.path.exists(local_path):
return local_path

Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new local-root resolution behavior in get_data_path is important to avoid unnecessary downloads. There are existing tests that call get_data_path, but it would be valuable to add a focused unit test that (1) sets EMBODICHAIN_DATA_ROOT to a temp dir, (2) creates a dummy file under the expected local location, and (3) asserts get_data_path returns it without instantiating a dataset class (no network / Open3D download side effects).

Copilot uses AI. Check for mistakes.
Comment on lines +27 to 32
EMBODICHAIN_DEFAULT_DATASET_ROOT = os.environ.get(
"EMBODICHAIN_DATASET_ROOT", str(Path.home() / ".cache" / "embodichain_datasets")
)
EMBODICHAIN_DEFAULT_DATABASE_ROOT = str(
Path.home() / ".cache" / "embodichain" / "database"
)
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that EMBODICHAIN_DEFAULT_DATA_ROOT can point to an arbitrary path via EMBODICHAIN_DATA_ROOT, the cleanup logic in EmbodiChainDataset.check_zip() (dataset.py) becomes unreliable because its is_safe_path() hard-codes checks for substrings like embodichain_data/download and embodichain_data/extract. With a custom root, corrupted downloads/extracts may no longer be cleaned up. Consider updating the safety check to validate paths relative to the configured data root (e.g., Path(dir).resolve().is_relative_to(Path(path).resolve())) rather than matching a fixed string.

Suggested change
EMBODICHAIN_DEFAULT_DATASET_ROOT = os.environ.get(
"EMBODICHAIN_DATASET_ROOT", str(Path.home() / ".cache" / "embodichain_datasets")
)
EMBODICHAIN_DEFAULT_DATABASE_ROOT = str(
Path.home() / ".cache" / "embodichain" / "database"
)
EMBODICHAIN_DEFAULT_DATA_ROOT_PATH = Path(
EMBODICHAIN_DEFAULT_DATA_ROOT
).expanduser().resolve()
EMBODICHAIN_DEFAULT_DATASET_ROOT = os.environ.get(
"EMBODICHAIN_DATASET_ROOT",
str(Path.home() / ".cache" / "embodichain_datasets"),
)
EMBODICHAIN_DEFAULT_DATASET_ROOT_PATH = Path(
EMBODICHAIN_DEFAULT_DATASET_ROOT
).expanduser().resolve()
EMBODICHAIN_DEFAULT_DATABASE_ROOT = str(
Path.home() / ".cache" / "embodichain" / "database"
)
EMBODICHAIN_DEFAULT_DATABASE_ROOT_PATH = Path(
EMBODICHAIN_DEFAULT_DATABASE_ROOT
).expanduser().resolve()

Copilot uses AI. Check for mistakes.
Comment on lines +110 to +114
extract_dir = os.path.join(EMBODICHAIN_DEFAULT_DATA_ROOT, "extract", prefix)
if os.path.exists(extract_dir) and os.listdir(extract_dir):
return # already extracted

download_dir = os.path.join(EMBODICHAIN_DEFAULT_DATA_ROOT, "download", prefix)
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_ensure_extract accepts data_obj but never uses it. Either drop the parameter or (preferably) use data_obj.download_dir / data_obj.extract_dir to compute the copy locations so this helper stays correct even if a dataset is instantiated with a non-default data root.

Suggested change
extract_dir = os.path.join(EMBODICHAIN_DEFAULT_DATA_ROOT, "extract", prefix)
if os.path.exists(extract_dir) and os.listdir(extract_dir):
return # already extracted
download_dir = os.path.join(EMBODICHAIN_DEFAULT_DATA_ROOT, "download", prefix)
# Infer the data root from the dataset object so we respect custom data roots.
extract_base = getattr(data_obj, "extract_dir", None)
if extract_base is not None:
# Typical Open3D layout: <data_root>/extract/<something>
data_root = os.path.dirname(os.path.dirname(extract_base))
else:
# Fallback to the project default if the attribute is missing.
data_root = EMBODICHAIN_DEFAULT_DATA_ROOT
extract_dir = os.path.join(data_root, "extract", prefix)
if os.path.exists(extract_dir) and os.listdir(extract_dir):
return # already extracted
download_dir = os.path.join(data_root, "download", prefix)

Copilot uses AI. Check for mistakes.
Comment on lines +64 to +73
| Category | Description | Examples |
|-------------|------------------------------------------------|-------------------------------------------------|
| `robot` | Robot URDF models | CobotMagicArm, Franka, UniversalRobots, UnitreeH1 |
| `eef` | End-effector / gripper models | DH_PGC_140_50, Robotiq2F85, InspireHand |
| `obj` | Manipulable objects and furniture | ShopTableSimple, CoffeeCup, TableWare |
| `scene` | Full scene environments | SceneData, EmptyRoom |
| `materials` | Rendering materials, IBL, and backgrounds | SimResources, CocoBackground |
| `w1` | DexForce W1 humanoid robot and components | DexforceW1V021, DexforceW1ChassisV021 |
| `demo` | Demo scene data | ScoopIceNewEnv, MultiW1Data |

Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The markdown table under "Asset Categories" is written with double pipes (|| ...) which will render as an extra empty column in most markdown renderers. Use single-pipe table syntax (| Category | Description | Examples | etc.) so the table renders correctly in the docs site.

Copilot uses AI. Check for mistakes.
_ensure_extract(data_obj, cls_name)
print(f" ✓ {cls_name} ready")
except Exception as exc:
print(f" ✗ {cls_name} failed: {exc}", file=sys.stderr)
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

download_asset() swallows all exceptions and only prints an error, so the overall download command will still exit with status 0 even if one or more assets failed to download/extract. For CI/automation use, consider tracking failures and exiting non-zero (or re-raising) when any target fails.

Suggested change
print(f" ✗ {cls_name} failed: {exc}", file=sys.stderr)
print(f" ✗ {cls_name} failed: {exc}", file=sys.stderr)
raise

Copilot uses AI. Check for mistakes.
@yuecideng yuecideng merged commit 3ed083e into main Mar 31, 2026
13 of 14 checks passed
@yuecideng yuecideng deleted the feat/data-download-cli branch March 31, 2026 06:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Related to data module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants