Skip to content

Add TALES score analysis and HuggingFace dataset builder scripts#37

Open
MarcCote wants to merge 9 commits intomainfrom
macote/analysis
Open

Add TALES score analysis and HuggingFace dataset builder scripts#37
MarcCote wants to merge 9 commits intomainfrom
macote/analysis

Conversation

@MarcCote
Copy link
Copy Markdown
Contributor

@MarcCote MarcCote commented Apr 1, 2026

This pull request introduces a new script, analysis/build_hf_dataset.py, that streamlines the process of building a Hugging Face-compatible dataset from TALES wandb trajectories. The script automates downloading, filtering, deduplication, and enrichment of trajectory data, and supports incremental builds with caching. It outputs the final dataset in either Parquet or JSONL format, ready for use with datasets.load_dataset().

Key features and functionality:

Dataset construction and enrichment:

  • Downloads JSONL rollout files from wandb runs, enriches each step with run metadata (e.g., model, game, framework, seed), and saves the result as a Parquet or JSONL dataset.

Filtering, deduplication, and caching:

  • Supports flexible filtering by model, framework, game, max_steps, and seed, and deduplicates runs to keep only the longest per (model, game, seed) tuple.
  • Implements incremental builds via a cache file, avoiding re-downloading already-processed runs and merging new data with cached data.

Usability and output:

  • Provides a command-line interface with detailed usage instructions, dry-run support for listing matching runs, and informative output summaries.
  • Ensures the dataset columns are ordered with metadata first, followed by rollout data, and prints dataset

MarcCote and others added 8 commits March 27, 2026 11:50
Enables extending experiments beyond their original step limit by replaying
a previous trajectory from wandb, then letting the LLM agent take over.

Features:
- --continue-from <run_id>: replay a specific wandb run
- --continue-from (no value): auto-find the best matching run by game,
  LLM model, agent type, and seed (picks the run with the most steps)
- Replay phase feeds recorded actions to the env with no LLM calls,
  preserving original token usage stats from the wandb trajectory
- Verifies replay fidelity by comparing observations (warns on divergence)
- Truncates trajectory when target steps < existing run steps
- Skips LLM loop when game was already won with max score
- Logs as a new wandb run referencing the original run ID in config

New files:
- tales/wandb_utils.py: fetch_run_trajectory() and find_matching_run()
- scripts/test_replay_determinism.py: validates deterministic replay
  across all 5 environment frameworks (Jericho, TextWorld, ALFWorld,
  TextWorldExpress, ScienceWorld)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Extract shared helpers (_make_state, _step_env, _check_invalid,
_handle_done, _record_step) and standalone replay_trajectory() and
play_with_agent() functions from the monolithic evaluate() function.

This improves readability and maintainability of the replay/continue
logic while preserving identical behavior.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Document that trajectories are truncated to --nb-steps when the
original run is longer, and that replay stops early when the game
was already completed (max score reached).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add gpt-5.3, gpt-5.4 to OPENAI_MODELS in reasoning agent
- Remove claude-haiku-4.5 from CLAUDE_MODELS (not a reasoning model)
- Simplify token counter: use startswith for gpt-5.x and gpt-4.1.x
- Fix wandb duplicate-run check: always check (not just when
  force_all is off), add project path, exclude 'without-help' tag

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Default is 'tales'; entity resolved by wandb from the logged-in user.
Removes hardcoded org/entity references from the codebase.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fetches run data from wandb, computes average normalized score (TALES
metric) at configurable budget intervals, and generates line plots
with per-framework breakdowns.

Features:
- Per-step history caching (reusable across budget intervals)
- Incremental cache updates (only downloads new runs)
- Per-(model, budget) completeness filtering
- --max-steps: filter by one or more max_steps values
- --budget-step: configurable budget interval
- --continuous: smooth curves at every step (with merge_asof optimization)
- --log-x: logarithmic x-axis scale

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Downloads JSONL rollout files from wandb runs, enriches each step
with run metadata, and saves as Parquet or JSONL for HuggingFace.

Features:
- Filtering by --models, --frameworks, --games, --max-steps, --seeds
- Deduplication (longest run per model/game/seed tuple)
- Incremental builds via --cache (only fetches new runs)
- --dry-run to preview matching runs
- --format parquet|jsonl

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Base automatically changed from macote/support-resume to main May 1, 2026 18:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant