Add TALES score analysis and HuggingFace dataset builder scripts by MarcCote · Pull Request #37 · microsoft/tale-suite

MarcCote · 2026-04-01T19:26:41Z

This pull request introduces a new script, analysis/build_hf_dataset.py, that streamlines the process of building a Hugging Face-compatible dataset from TALES wandb trajectories. The script automates downloading, filtering, deduplication, and enrichment of trajectory data, and supports incremental builds with caching. It outputs the final dataset in either Parquet or JSONL format, ready for use with datasets.load_dataset().

Key features and functionality:

Dataset construction and enrichment:

Downloads JSONL rollout files from wandb runs, enriches each step with run metadata (e.g., model, game, framework, seed), and saves the result as a Parquet or JSONL dataset.

Filtering, deduplication, and caching:

Supports flexible filtering by model, framework, game, max_steps, and seed, and deduplicates runs to keep only the longest per (model, game, seed) tuple.
Implements incremental builds via a cache file, avoiding re-downloading already-processed runs and merging new data with cached data.

Usability and output:

Provides a command-line interface with detailed usage instructions, dry-run support for listing matching runs, and informative output summaries.
Ensures the dataset columns are ordered with metadata first, followed by rollout data, and prints dataset

Enables extending experiments beyond their original step limit by replaying a previous trajectory from wandb, then letting the LLM agent take over. Features: - --continue-from <run_id>: replay a specific wandb run - --continue-from (no value): auto-find the best matching run by game, LLM model, agent type, and seed (picks the run with the most steps) - Replay phase feeds recorded actions to the env with no LLM calls, preserving original token usage stats from the wandb trajectory - Verifies replay fidelity by comparing observations (warns on divergence) - Truncates trajectory when target steps < existing run steps - Skips LLM loop when game was already won with max score - Logs as a new wandb run referencing the original run ID in config New files: - tales/wandb_utils.py: fetch_run_trajectory() and find_matching_run() - scripts/test_replay_determinism.py: validates deterministic replay across all 5 environment frameworks (Jericho, TextWorld, ALFWorld, TextWorldExpress, ScienceWorld) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Extract shared helpers (_make_state, _step_env, _check_invalid, _handle_done, _record_step) and standalone replay_trajectory() and play_with_agent() functions from the monolithic evaluate() function. This improves readability and maintainability of the replay/continue logic while preserving identical behavior. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Document that trajectories are truncated to --nb-steps when the original run is longer, and that replay stops early when the game was already completed (max score reached). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Add gpt-5.3, gpt-5.4 to OPENAI_MODELS in reasoning agent - Remove claude-haiku-4.5 from CLAUDE_MODELS (not a reasoning model) - Simplify token counter: use startswith for gpt-5.x and gpt-4.1.x - Fix wandb duplicate-run check: always check (not just when force_all is off), add project path, exclude 'without-help' tag Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Default is 'tales'; entity resolved by wandb from the logged-in user. Removes hardcoded org/entity references from the codebase. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Fetches run data from wandb, computes average normalized score (TALES metric) at configurable budget intervals, and generates line plots with per-framework breakdowns. Features: - Per-step history caching (reusable across budget intervals) - Incremental cache updates (only downloads new runs) - Per-(model, budget) completeness filtering - --max-steps: filter by one or more max_steps values - --budget-step: configurable budget interval - --continuous: smooth curves at every step (with merge_asof optimization) - --log-x: logarithmic x-axis scale Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Downloads JSONL rollout files from wandb runs, enriches each step with run metadata, and saves as Parquet or JSONL for HuggingFace. Features: - Filtering by --models, --frameworks, --games, --max-steps, --seeds - Deduplication (longest run per model/game/seed tuple) - Incremental builds via --cache (only fetches new runs) - --dry-run to preview matching runs - --format parquet|jsonl Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

MarcCote and others added 8 commits March 27, 2026 11:50

Read WANDB_PROJECT from environment variable with fallback

dbbee55

Default is 'tales'; entity resolved by wandb from the logged-in user. Removes hardcoded org/entity references from the codebase. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Use shared WANDB_PROJECT constant from wandb_utils in benchmark

d87862f

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Base automatically changed from macote/support-resume to main May 1, 2026 18:40

Merge branch 'main' into macote/analysis

bf5b292

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add TALES score analysis and HuggingFace dataset builder scripts#37

Add TALES score analysis and HuggingFace dataset builder scripts#37
MarcCote wants to merge 9 commits intomainfrom
macote/analysis

MarcCote commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MarcCote commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant