Add --continue-from flag for replaying and extending experiments#36
Merged
Add --continue-from flag for replaying and extending experiments#36
Conversation
Enables extending experiments beyond their original step limit by replaying a previous trajectory from wandb, then letting the LLM agent take over. Features: - --continue-from <run_id>: replay a specific wandb run - --continue-from (no value): auto-find the best matching run by game, LLM model, agent type, and seed (picks the run with the most steps) - Replay phase feeds recorded actions to the env with no LLM calls, preserving original token usage stats from the wandb trajectory - Verifies replay fidelity by comparing observations (warns on divergence) - Truncates trajectory when target steps < existing run steps - Skips LLM loop when game was already won with max score - Logs as a new wandb run referencing the original run ID in config New files: - tales/wandb_utils.py: fetch_run_trajectory() and find_matching_run() - scripts/test_replay_determinism.py: validates deterministic replay across all 5 environment frameworks (Jericho, TextWorld, ALFWorld, TextWorldExpress, ScienceWorld) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Extract shared helpers (_make_state, _step_env, _check_invalid, _handle_done, _record_step) and standalone replay_trajectory() and play_with_agent() functions from the monolithic evaluate() function. This improves readability and maintainability of the replay/continue logic while preserving identical behavior. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Document that trajectories are truncated to --nb-steps when the original run is longer, and that replay stops early when the game was already completed (max score reached). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add gpt-5.3, gpt-5.4 to OPENAI_MODELS in reasoning agent - Remove claude-haiku-4.5 from CLAUDE_MODELS (not a reasoning model) - Simplify token counter: use startswith for gpt-5.x and gpt-4.1.x - Fix wandb duplicate-run check: always check (not just when force_all is off), add project path, exclude 'without-help' tag Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Default is 'tales'; entity resolved by wandb from the logged-in user. Removes hardcoded org/entity references from the codebase. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request introduces support for continuing a previous run from a specific point, enabling users to extend experiments without repeating LLM calls for already-completed steps. It also adds robust utilities for searching and retrieving previous runs and their trajectories from Weights & Biases (wandb), and includes a new script to test the determinism of environment replays. These changes improve experiment reproducibility, enable efficient benchmarking, and provide tooling to verify that replayed trajectories match the originals.
New functionality for continuing previous runs:
README.mdto document the new--continue-fromflag inbenchmark.py, which allows users to extend a previous run by replaying the original trajectory (without LLM calls) and then continuing with new steps. It explains both explicit run ID and auto-find modes, and details the step-by-step process.wandb integration and utilities:
tales/wandb_utils.py, which provides functions to:These utilities ensure that experiments can be reliably resumed and compared.
Testing and validation:
scripts/test_replay_determinism.py, a script that verifies replay determinism by running a game with a random agent, recording the trajectory, and then replaying the same actions to ensure observations and scores match exactly. This helps confirm that the environment and replay logic are deterministic.