Add --continue-from flag for replaying and extending experiments by MarcCote · Pull Request #36 · microsoft/tale-suite

MarcCote · 2026-03-27T18:55:29Z

This pull request introduces support for continuing a previous run from a specific point, enabling users to extend experiments without repeating LLM calls for already-completed steps. It also adds robust utilities for searching and retrieving previous runs and their trajectories from Weights & Biases (wandb), and includes a new script to test the determinism of environment replays. These changes improve experiment reproducibility, enable efficient benchmarking, and provide tooling to verify that replayed trajectories match the originals.

New functionality for continuing previous runs:

Updated the README.md to document the new --continue-from flag in benchmark.py, which allows users to extend a previous run by replaying the original trajectory (without LLM calls) and then continuing with new steps. It explains both explicit run ID and auto-find modes, and details the step-by-step process.

wandb integration and utilities:

Added tales/wandb_utils.py, which provides functions to:
- Find a matching wandb run based on environment and agent parameters.
- Fetch the run configuration and rollout trajectory from wandb.
- Download and parse the rollout JSONL file as a pandas DataFrame.
  These utilities ensure that experiments can be reliably resumed and compared.

Testing and validation:

Introduced scripts/test_replay_determinism.py, a script that verifies replay determinism by running a game with a random agent, recording the trajectory, and then replaying the same actions to ensure observations and scores match exactly. This helps confirm that the environment and replay logic are deterministic.

Enables extending experiments beyond their original step limit by replaying a previous trajectory from wandb, then letting the LLM agent take over. Features: - --continue-from <run_id>: replay a specific wandb run - --continue-from (no value): auto-find the best matching run by game, LLM model, agent type, and seed (picks the run with the most steps) - Replay phase feeds recorded actions to the env with no LLM calls, preserving original token usage stats from the wandb trajectory - Verifies replay fidelity by comparing observations (warns on divergence) - Truncates trajectory when target steps < existing run steps - Skips LLM loop when game was already won with max score - Logs as a new wandb run referencing the original run ID in config New files: - tales/wandb_utils.py: fetch_run_trajectory() and find_matching_run() - scripts/test_replay_determinism.py: validates deterministic replay across all 5 environment frameworks (Jericho, TextWorld, ALFWorld, TextWorldExpress, ScienceWorld) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Extract shared helpers (_make_state, _step_env, _check_invalid, _handle_done, _record_step) and standalone replay_trajectory() and play_with_agent() functions from the monolithic evaluate() function. This improves readability and maintainability of the replay/continue logic while preserving identical behavior. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Document that trajectories are truncated to --nb-steps when the original run is longer, and that replay stops early when the game was already completed (max score reached). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Add gpt-5.3, gpt-5.4 to OPENAI_MODELS in reasoning agent - Remove claude-haiku-4.5 from CLAUDE_MODELS (not a reasoning model) - Simplify token counter: use startswith for gpt-5.x and gpt-4.1.x - Fix wandb duplicate-run check: always check (not just when force_all is off), add project path, exclude 'without-help' tag Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Default is 'tales'; entity resolved by wandb from the logged-in user. Removes hardcoded org/entity references from the codebase. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

MarcCote and others added 6 commits March 27, 2026 11:50

Read WANDB_PROJECT from environment variable with fallback

dbbee55

Default is 'tales'; entity resolved by wandb from the logged-in user. Removes hardcoded org/entity references from the codebase. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Use shared WANDB_PROJECT constant from wandb_utils in benchmark

d87862f

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

MarcCote merged commit bd346e2 into main May 1, 2026
5 checks passed

MarcCote deleted the macote/support-resume branch May 1, 2026 18:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add --continue-from flag for replaying and extending experiments#36

Add --continue-from flag for replaying and extending experiments#36
MarcCote merged 6 commits intomainfrom
macote/support-resume

MarcCote commented Mar 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MarcCote commented Mar 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant