Skip to content

Add --continue-from flag for replaying and extending experiments#36

Merged
MarcCote merged 6 commits intomainfrom
macote/support-resume
May 1, 2026
Merged

Add --continue-from flag for replaying and extending experiments#36
MarcCote merged 6 commits intomainfrom
macote/support-resume

Conversation

@MarcCote
Copy link
Copy Markdown
Contributor

This pull request introduces support for continuing a previous run from a specific point, enabling users to extend experiments without repeating LLM calls for already-completed steps. It also adds robust utilities for searching and retrieving previous runs and their trajectories from Weights & Biases (wandb), and includes a new script to test the determinism of environment replays. These changes improve experiment reproducibility, enable efficient benchmarking, and provide tooling to verify that replayed trajectories match the originals.

New functionality for continuing previous runs:

  • Updated the README.md to document the new --continue-from flag in benchmark.py, which allows users to extend a previous run by replaying the original trajectory (without LLM calls) and then continuing with new steps. It explains both explicit run ID and auto-find modes, and details the step-by-step process.

wandb integration and utilities:

  • Added tales/wandb_utils.py, which provides functions to:
    • Find a matching wandb run based on environment and agent parameters.
    • Fetch the run configuration and rollout trajectory from wandb.
    • Download and parse the rollout JSONL file as a pandas DataFrame.
      These utilities ensure that experiments can be reliably resumed and compared.

Testing and validation:

  • Introduced scripts/test_replay_determinism.py, a script that verifies replay determinism by running a game with a random agent, recording the trajectory, and then replaying the same actions to ensure observations and scores match exactly. This helps confirm that the environment and replay logic are deterministic.

MarcCote and others added 6 commits March 27, 2026 11:50
Enables extending experiments beyond their original step limit by replaying
a previous trajectory from wandb, then letting the LLM agent take over.

Features:
- --continue-from <run_id>: replay a specific wandb run
- --continue-from (no value): auto-find the best matching run by game,
  LLM model, agent type, and seed (picks the run with the most steps)
- Replay phase feeds recorded actions to the env with no LLM calls,
  preserving original token usage stats from the wandb trajectory
- Verifies replay fidelity by comparing observations (warns on divergence)
- Truncates trajectory when target steps < existing run steps
- Skips LLM loop when game was already won with max score
- Logs as a new wandb run referencing the original run ID in config

New files:
- tales/wandb_utils.py: fetch_run_trajectory() and find_matching_run()
- scripts/test_replay_determinism.py: validates deterministic replay
  across all 5 environment frameworks (Jericho, TextWorld, ALFWorld,
  TextWorldExpress, ScienceWorld)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Extract shared helpers (_make_state, _step_env, _check_invalid,
_handle_done, _record_step) and standalone replay_trajectory() and
play_with_agent() functions from the monolithic evaluate() function.

This improves readability and maintainability of the replay/continue
logic while preserving identical behavior.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Document that trajectories are truncated to --nb-steps when the
original run is longer, and that replay stops early when the game
was already completed (max score reached).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add gpt-5.3, gpt-5.4 to OPENAI_MODELS in reasoning agent
- Remove claude-haiku-4.5 from CLAUDE_MODELS (not a reasoning model)
- Simplify token counter: use startswith for gpt-5.x and gpt-4.1.x
- Fix wandb duplicate-run check: always check (not just when
  force_all is off), add project path, exclude 'without-help' tag

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Default is 'tales'; entity resolved by wandb from the logged-in user.
Removes hardcoded org/entity references from the codebase.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@MarcCote MarcCote merged commit bd346e2 into main May 1, 2026
5 checks passed
@MarcCote MarcCote deleted the macote/support-resume branch May 1, 2026 18:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant