This repository contains the files needed to benchmark language agents on a curated list of text-based games from the following frameworks: Jericho, TextWorld, TextWorld-Express, ScienceWorld, ALFWorld).
[Technical Report] [Project Page]
It is recommended to create and activate a conda or virtual environment. tales requires Python>=3.12:
conda create -n tales python=3.12
conda activate tales
Then, install tales directly from PyPI:
pip install tale-suite
Warning
The name of the Python package on PyPI is tale-suite and not tales.
Alternatively, clone the repository and install locally:
git clone https://github.com/microsoft/tale-suite
cd tale-suite
pip install -e .
Warning
You will need Java 1.8+ installed to run the environments TextWorld-Express and ScienceWorld.
sudo apt update && apt install openjdk-8-jre-headless -y
Alternatively, if the above isn't working:
sudo apt-get update && apt-get install default-jre default-jdk
We provide a pre-built docker image at
docker pull czcui/twb:prebuilt
An example script can be found in the scripts folder.
-
Run benchmark evaluation on all the games for the specified random agent:
python benchmark.py --agent agents/random.py random
-
Run benchmark evaluation on a subset of the games:
python benchmark.py --agent agents/random.py random --env textworld
-
Run benchmark evaluation on specific games:
python benchmark.py --agent agents/random.py random --envs JerichoEnvZork1 JerichoEnvDetective
-
Run benchmark evaluation using as a HumanAgent:
python benchmark.py --agent agents/human.py human --envs TWCookingLevel1
-
Run benchmark evaluation where the ground-truth walkthrough is being followed:
python benchmark.py --agent agents/walkthrough.py walkthrough --envs JerichoEnvZork1
In order to benchmark a given LLM acting as language agent playing text-based games, you will need to first configure it. tales is leveraging the llm library to handle communication with different LLMs.
python benchmark.py --agent agents/llm.py zero-shot --envs TWCookingLevel1
Use the --continue-from flag to replay a previous wandb-logged trajectory. The replay is deterministic (no LLM calls) and preserves the original token usage stats. This is useful for:
- Extending a short run (e.g., 100 steps → 1000): replays the original trajectory, then lets the LLM take over for the remaining steps.
- Reproducing a previous run exactly: set
--nb-stepsequal to or less than the original run's length to replay without any new LLM calls.
When auto-finding, the longest matching run is always selected and truncated to --nb-steps if needed.
With an explicit run ID:
python benchmark.py reasoning --llm gpt-4o --conversation --continue-from <wandb_run_id> --nb-steps 1000 --envs JerichoEnvZork1 --wandb
Auto-find matching run (searches wandb for a run matching the current game, agent, and seed):
python benchmark.py reasoning --llm gpt-4o --conversation --continue-from --nb-steps 1000 --envs JerichoEnvZork1 --wandb
How it works:
- Fetches the original run's config and rollout from the wandb API (or auto-finds the longest matching run)
- Truncates the trajectory to
--nb-stepsif the original run is longer - Recreates the environment with the same seed for deterministic replay
- Replays recorded actions (no LLM calls, preserving original token usage stats)
- Verifies observations match the logged trajectory (warns on divergence)
- If the game completed during replay (max score reached), stops early
- Otherwise, hands off to the LLM agent for any remaining steps
- Logs as a new wandb run referencing the original run ID
Note
The --continue-from flag expects a wandb run ID (e.g., abc123de) from your wandb project, or no value to auto-find. The agent type and parameters must match the original run. When auto-finding, if no matching run is found, the game runs from scratch.
llm natively supports OpenAI models and self-hosted models that offer an OpenAI-compatible API (e.g. like vLLM does - more on this below).
llm offers different plugins to include other LLMs. E.g.
llm install llm-anthropic
See the llmplugins page for more information.
To serve a custom HugginFace model with vLLM, one can use the vllm docker image like this:
docker run --runtime nvidia --gpus all --restart unless-stopped --name vllm-Llama-3.1-8B-Instruct --env "HUGGING_FACE_HUB_TOKEN=${HUGGING_FACE_HUB_TOKEN}" -v ~/.cache/huggingface:/root/.cache/huggingface -p 8000:8000 --ipc=host vllm/vllm-openai:latest --model meta-llama/Llama-3.1-8B-Instruct --tensor-parallel-size 4 --host 0.0.0.0
Then, add the following entrypoint in ~/.config/io.datasette.llm/extra-openai-models.yaml
- model_id: meta-llama/Llama-3.1-8B-Instruct
model_name: meta-llama/Llama-3.1-8B-Instruct
api_base: "http://0.0.0.0:8000/v1"
You can check that everything is working properly with this simple command:
llm -m meta-llama/Llama-3.1-8B-Instruct "Hi. What's your name?"
To build a custom agent, you need to create a new file (e.g., custom.py) in the agents folder and implement the Agent class and implement the proper arguments parser.
from typing import Dict, Any
import tales
class CustomAgent(tales.Agent):
def act(self, obs: str, reward: float, done: bool, infos: Dict[str, Any]) -> str:
# ...
return "help"
def build_argparser(parser=None):
return parser or argparse.ArgumentParser()
register(
name="my-agent",
desc=(
"This is a custom agent that always output 'help' as a text action."
),
klass=CustomAgent,
add_arguments=build_argparser,
)You can then use this agent by specifying the path to the file and the class name in the --agent argument.
python benchmark.py --agent agents/custom.py my-agent
Note
See the agents folder for more concrete examples.
TALES offers both train splits and test splits, the latter of which make up the games all models in our technical report were evaluated on.
The following is an example of how to import desired environments and allow an agent to play through them.
Note that importing the relevant framework automatically registers all environments in that framework with gym. You can individually import the frameworks if you want to only evaluate on them one at a time. For now, we do not include a jericho train split.
import gymnasium as gym
from tales import *
# Training splits
train_envs = [env_spec.id for env_spec in gym.envs.registry.values() if "tales/" in env_spec.id and 'train' in env_spec.id]
# Testing splits
envs = [env_spec.id for env_spec in gym.envs.registry.values() if "tales/" in env_spec.id and 'train' not in env_spec.id]
train_env = gym.make(
train_envs[0],
disable_env_checker=True,
admissible_commands=True,
)
test_env = gym.make(
envs[0],
disable_env_checker=True,
admissible_commands=True,
)
@article{cui2025tales,
title={TALES: Text-Adventure Learning Environment Suite},
author={Christopher Cui, Xingdi Yuan, Ziang Xiao, Prithviraj Ammanabrolu, Marc-Alexandre C\^ot\'e},
journal={arXiv preprint arXiv:2504.14128},
year={2025},
url={https://arxiv.org/abs/2504.14128}
}
If you use this benchmark, please consider citing the original frameworks as well.
@article{cote18textworld,
author = {Marc-Alexandre C\^ot\'e and \'Akos K\'ad\'ar and Xingdi Yuan and Ben Kybartas and Tavian Barnes and Emery Fine and James Moore and Ruo Yu Tao and Matthew Hausknecht and Layla El Asri and Mahmoud Adada and Wendy Tay and Adam Trischler},
title = {TextWorld: A Learning Environment for Text-based Games},
journal = {CoRR},
volume = {abs/1806.11532},
year = {2018}
}
@article{jansen2022textworldexpress,
url = {https://arxiv.org/abs/2208.01174},
author = {Jansen, Peter A. and Côté, Marc-Alexandre},
title = {TextWorldExpress: Simulating Text Games at One Million Steps Per Second},
journal = {arXiv},
year = {2022},
}
@inproceedings{hausknecht2020interactive,
title={Interactive fiction games: A colossal adventure},
author={Hausknecht, Matthew and Ammanabrolu, Prithviraj and C{\^o}t{\'e}, Marc-Alexandre and Yuan, Xingdi},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
volume={34},
number={05},
year={2020}
}
@inproceedings{ALFWorld20,
title ={{ALFWorld: Aligning Text and Embodied Environments for Interactive Learning}},
author={Mohit Shridhar and Xingdi Yuan and Marc-Alexandre C\^ot\'e and Yonatan Bisk and Adam Trischler and Matthew Hausknecht},
booktitle = {Proceedings of the International
Conference on Learning Representations (ICLR)},
year = {2021},
url = {https://arxiv.org/abs/2010.03768}}
@misc{scienceworld2022,
title={ScienceWorld: Is your Agent Smarter than a 5th Grader?},
author={Ruoyao Wang and Peter Jansen and Marc-Alexandre C{\^o}t{\'e} and Prithviraj Ammanabrolu},
year={2022},
eprint={2203.07540},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2203.07540}
}
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.
This framework does not collect user's personal data. For more information about Microsoft's privacy policies. Please see Microsoft Privacy Statement.
Please see our Responsible AI Statement.