Add TALES score analysis and HuggingFace dataset builder scripts#37
Open
Add TALES score analysis and HuggingFace dataset builder scripts#37
Conversation
Enables extending experiments beyond their original step limit by replaying a previous trajectory from wandb, then letting the LLM agent take over. Features: - --continue-from <run_id>: replay a specific wandb run - --continue-from (no value): auto-find the best matching run by game, LLM model, agent type, and seed (picks the run with the most steps) - Replay phase feeds recorded actions to the env with no LLM calls, preserving original token usage stats from the wandb trajectory - Verifies replay fidelity by comparing observations (warns on divergence) - Truncates trajectory when target steps < existing run steps - Skips LLM loop when game was already won with max score - Logs as a new wandb run referencing the original run ID in config New files: - tales/wandb_utils.py: fetch_run_trajectory() and find_matching_run() - scripts/test_replay_determinism.py: validates deterministic replay across all 5 environment frameworks (Jericho, TextWorld, ALFWorld, TextWorldExpress, ScienceWorld) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Extract shared helpers (_make_state, _step_env, _check_invalid, _handle_done, _record_step) and standalone replay_trajectory() and play_with_agent() functions from the monolithic evaluate() function. This improves readability and maintainability of the replay/continue logic while preserving identical behavior. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Document that trajectories are truncated to --nb-steps when the original run is longer, and that replay stops early when the game was already completed (max score reached). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add gpt-5.3, gpt-5.4 to OPENAI_MODELS in reasoning agent - Remove claude-haiku-4.5 from CLAUDE_MODELS (not a reasoning model) - Simplify token counter: use startswith for gpt-5.x and gpt-4.1.x - Fix wandb duplicate-run check: always check (not just when force_all is off), add project path, exclude 'without-help' tag Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Default is 'tales'; entity resolved by wandb from the logged-in user. Removes hardcoded org/entity references from the codebase. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fetches run data from wandb, computes average normalized score (TALES metric) at configurable budget intervals, and generates line plots with per-framework breakdowns. Features: - Per-step history caching (reusable across budget intervals) - Incremental cache updates (only downloads new runs) - Per-(model, budget) completeness filtering - --max-steps: filter by one or more max_steps values - --budget-step: configurable budget interval - --continuous: smooth curves at every step (with merge_asof optimization) - --log-x: logarithmic x-axis scale Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Downloads JSONL rollout files from wandb runs, enriches each step with run metadata, and saves as Parquet or JSONL for HuggingFace. Features: - Filtering by --models, --frameworks, --games, --max-steps, --seeds - Deduplication (longest run per model/game/seed tuple) - Incremental builds via --cache (only fetches new runs) - --dry-run to preview matching runs - --format parquet|jsonl Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request introduces a new script,
analysis/build_hf_dataset.py, that streamlines the process of building a Hugging Face-compatible dataset from TALES wandb trajectories. The script automates downloading, filtering, deduplication, and enrichment of trajectory data, and supports incremental builds with caching. It outputs the final dataset in either Parquet or JSONL format, ready for use withdatasets.load_dataset().Key features and functionality:
Dataset construction and enrichment:
Filtering, deduplication, and caching:
Usability and output: