CPMöbius: Iterative Coach–Player Reasoning for Data-Free Reinforcement Learning

A collaborative alternative to self-play. Teach Large Language Models to reason without any external training data through a Coach that proposes targeted instructions and a Player that learns by solving them.

🏴󠁶󠁵󠁭󠁡󠁰󠁿 Overview

Training strong reasoning LLMs traditionally depends on massive, human-curated tasks and labels — through SFT or RL on reasoning-specific data. This dependence is increasingly unsustainable, and the diminishing returns of supervision-heavy paradigms are already visible in practice.

CPMöbius breaks this dependence with a collaborative Coach–Player paradigm for data-free reinforcement learning of reasoning models. Unlike adversarial self-play (e.g., R-Zero, AZR), CPMöbius — inspired by real-world sports coaching and multi-agent collaboration — treats the Coach and Player as independent but cooperative roles:

The Coach 🧭 proposes instructions targeted at the Player's current capability and is rewarded based on changes in the Player's performance, not on beating the Player.
The Player ⚽ is rewarded for solving the increasingly instructive tasks generated by the Coach.

This forms a Möbius-like cooperative optimization loop: the Coach learns what the Player needs next, and the Player learns how to solve it — with no external problem set, no human labels, and no adversarial pressure.

Key Features

Fully Data-Free. No external problems, no annotated solutions, no SFT warm-up on reasoning data.
Collaborative. The Coach is rewarded for the Player's improvement, sidestepping the instability and reward hacking that plague competitive self-play.
Adaptive Curriculum. Tasks that are generated by the coach are continuously calibrated to the Player's evolving frontier, keeping problems neither too easy nor too hard.
Flexible Environment Feedback. AMC is not the only held-out validation set that can provide environment feedback; other validation sets such as AIME, Minerva, and OlympiadBench can also be used.
Outperforms Unsupervised Baselines. CPMöbius outperforms RENT on overall accuracy and R-Zero on OOD accuracy under matched settings.

⚡️ Quick Start

1. Set up the environment

git clone https://github.com/thunlp/CPMobius.git
cd CPMobius

conda env create -f env.yml
conda activate cpmobius

2. Configure training paths

Pick the launch script for your base model and edit the placeholders in it:

Base model	Script
Qwen2.5-Math-1.5B	`scripts/run_qwen2.5_math_1.5b.sh`
Qwen2.5-Math-7B-Instruct	`scripts/run_qwen2.5_math_7b_instruct.sh`
OpenMath-Nemotron-1.5B	`scripts/run_openmath_nemotron_1.5B.sh`
OctoThinker-3B-Hybrid-0	`scripts/run_octothinker_3B_hybrid_zero.sh`

# if you use wandb instead of swanlab, set export WANDB_API_KEY='Your Wandb API Key' and change trainer.logger=['console','swanlab'] to trainer.logger=['console','wandb'] in the script. Also change SWANLAB_LOG_DIR to WANDB_LOG_DIR

export SWANLAB_API_KEY='Your Swanlab API Key'

SWANLAB_LOG_DIR='Your Swanlab Log Directory'
VAL_FILES="Your path to validation parquet file"
PLAYER_MODEL_PATH="Your path to Qwen2.5-Math-1.5B"
COACH_MODEL_PATH="Your path to coach model"
CKPT_DIR="Your Checkpoint Directory"

3. Run the full Coach–Player loop

bash scripts/run_qwen2.5_math_1.5b.sh

The script runs the Coach–Player training loop and saves checkpoints under CKPT_DIR.

🤗 Converting checkpoints

bash utils/convert.sh <path_to_your_checkpoint1> <path_to_your_checkpoint2> ...

🧪 Evaluation

evaluation/run_math_all.sh evaluates a checkpoint on six math benchmarks through a single chat-style pipeline (utils/chat_eval.py + utils/parquet_loader.py). The parquet files in evaluation/data/ are pre-replicated (suffix _xN), so one greedy sample per row already gives self-consistency over N rollouts — to change N, regenerate the parquet.

1. Set up

cd evaluation
conda create -n cpmobius_eval python=3.10 -y
conda activate cpmobius_eval
pip install -r requirements_prime.txt   # pulls vllm, transformers, pandas, pyarrow

2. Run

# all six benchmarks
bash run_math_all.sh --model /path/to/checkpoint

# subset + custom sampling + output / GPU overrides
OUTPUT_ROOT=./my_results CUDA_VISIBLE_DEVICES=0,1 \
    bash run_math_all.sh --model /path/to/ckpt \
        --tasks "math500,aime,aime2025" \
        --temperature 0.7 --top-p 0.9 --repetition-penalty 1.05

Flag / env	Default	Notes
`--model`	(required)	HuggingFace path or local checkpoint
`--tasks`	all	comma-separated subset
`--run-name`	`basename($MODEL)`	sub-directory under `$OUTPUT_ROOT`
`--temperature`	`0.7`	sampling temperature
`--top-p`	`0.9`	nucleus top-p
`--repetition-penalty`	`1.05`	vLLM repetition penalty
`--max-tokens`	`4096`	per-sample max new tokens
`DATA_ROOT`	`evaluation/data`	parquet root
`OUTPUT_ROOT`	`evaluation/results`	output root
`CUDA_VISIBLE_DEVICES`	inherits shell	GPUs forwarded to vLLM

3. Output layout

Each task writes to $OUTPUT_ROOT/<run-name>/<task>/<sampling-tag>/, where <sampling-tag> is <temp>-<top_p>-<rep> with dots replaced by p (e.g. 0p7-0p9-1p05):

completions.jsonl       # raw model completions
results.jsonl           # extracted answer + correctness per row
results_summary.json    # {benchmark, total, correct, accuracy}
results_summary.txt     # human-readable one-liner

📊 Results

CPMöbius is evaluated on a suite of mathematical reasoning benchmarks, with both in-distribution (ID) and out-of-distribution (OOD) averages reported. We select four base models for our training experiments, representing the three main stages of a typical LLM training lifecycle: pre-training, supervised fine-tuning (SFT), and reinforcement learning.

Performance comparison between CPMöbius and baseline methods on mathematical reasoning benchmarks. Overall Average is the mean over all benchmarks. OOD Average is the mean over all benchmarks except AMC, because RENT was trained on AMC and CPMöbius validation also used AMC — separating it gives a fair in-distribution (AMC) vs. out-of-distribution comparison. Bold values indicate the best performance for each metric.

Models	Average	OOD Average	AMC	AIME 2024	AIME 2025	Minerva	MATH	Olympiad
Qwen2.5-Math-1.5B
Base Model	23.3	19.8	34.6	6.2	2.8	16.3	56.2	23.4
R-Zero (Iter 3)	27.1	24.7	39.2	9.8	5.0	19.3	62.4	26.8
RENT	27.1	24.7	39.3	10.0	5.0	19.0	62.2	27.1
CPMöbius	28.8	26.8	39.4	9.8	5.4	28.0	63.1	26.9
OpenMath-Nemotron-1.5B
Base Model	59.5	54.9	82.3	55.6	43.3	25.1	89.4	61.0
R-Zero (Iter 3)	–	–	–	–	–	–	–	–
RENT	61.7	56.5	87.7	55.0	46.0	24.2	90.7	66.7
CPMöbius	62.1	57.0	87.5	54.9	46.9	24.3	91.2	67.9
OctoThinker-3B-Hybrid-Zero
Base Model	21.3	20.6	24.6	3.9	1.7	16.3	57.9	23.4
R-Zero (Iter 3)	20.5	19.5	25.9	2.0	0.3	14.6	58.1	22.3
RENT	23.0	21.7	29.2	7.3	2.1	15.0	60.2	24.1
CPMöbius	23.6	22.0	28.0	4.8	1.7	22.1	60.4	24.7
Qwen2.5-Math-7B-Instruct
Base Model	35.8	33.0	49.2	9.0	6.3	34.6	78.0	37.4
R-Zero (Iter 3)	36.9	34.2	50.5	9.5	7.4	32.7	83.3	38.1
RENT	39.2	37.6	53.1	10.8	9.9	38.8	83.8	38.8
CPMöbius	40.7	38.4	55.6	11.8	9.6	44.9	84.2	38.3

Headline takeaways:

CPMöbius improves overall accuracy by +4.9 on Qwen2.5-Math-7B-Instruct without any external data.
On OOD benchmarks, CPMöbius gains +5.4 exceeding R-Zero by +4.2.
CPMöbius exceeds RENT by +1.5 on overall accuracy, demonstrating that targeted task generation outperforms pure entropy minimization.

Full per-benchmark numbers (MATH, AIME, AMC, OlympiadBench, College-Math, GaoKao, etc.) are in Section 5 of the paper.

❓ FAQ

Q: How is CPMöbius different from other self-play frameworks?

A: R-Zero and AZR are both very excellent frameworks and they cast the question generator and the solver as adversaries, i.e., the generator is rewarded for finding problems the solver fails on. In contrast, CPMöbius is collaborative: the Coach is rewarded by the Player's improvement (a change-in-performance signal), so the Coach has no incentive to push the Player off a cliff. Empirically this yields a more stable curriculum, especially on OOD benchmarks.

Q: What does "data-free" actually mean here?

A: It means CPMöbius does not consume any human-written reasoning problems or human-annotated solutions during the Coach–Player co-evolution loop. The base model and tokenizer are pretrained as usual, and the evaluation benchmarks remain held-out. No external math problem set (e.g., MATH train, NuminaMath, GSM8K) is used during training.

Q: How does the Coach receive reward without ground-truth labels?

A: The Coach is rewarded based on changes in the Player's performance on a held-out probe set evaluated under self-consistency / majority-voting style pseudo-labels. This avoids the need for human labels while still giving the Coach a directional signal toward "instructive" task distributions. See Section 3 of the paper.

Q: Why is the framework called Möbius?

A: The Coach–Player loop has no fixed "stronger" or "weaker" side; the two roles co-evolve along a single continuous optimization trajectory, like a Möbius strip with no separate inside and outside.

🙏 Acknowledgements

Our RL training stack is built on veRL. We utilize vLLM for rollouts. We use scripts from PRIME for evaluation. We thank all of authors at THUNLP for their excellent work.

💬 Citation

If our work is useful for you, please consider citing the paper:

@article{li2026cpmobius,
  title={CPMobius: Iterative Coach-Player Reasoning for Data-Free Reinforcement Learning},
  author={Li, Ran and Liu, Zeyuan and Chen, Yinghao and He, Bingxiang and Yuan, Jiarui and Fu, Zixuan and Chen, Weize and Hu, Jinyi and Liu, Zhiyuan and Sun, Maosong},
  journal={arXiv preprint arXiv:2602.02979},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
evaluation		evaluation
scripts		scripts
utils		utils
verl		verl
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
env.yml		env.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CPMöbius: Iterative Coach–Player Reasoning for Data-Free Reinforcement Learning

🏴󠁶󠁵󠁭󠁡󠁰󠁿 Overview

Key Features

⚡️ Quick Start

1. Set up the environment

2. Configure training paths

3. Run the full Coach–Player loop

🤗 Converting checkpoints

🧪 Evaluation

1. Set up

2. Run

3. Output layout

📊 Results

❓ FAQ

Q: How is CPMöbius different from other self-play frameworks?

Q: What does "data-free" actually mean here?

Q: How does the Coach receive reward without ground-truth labels?

Q: Why is the framework called Möbius?

🙏 Acknowledgements

💬 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CPMöbius: Iterative Coach–Player Reasoning for Data-Free Reinforcement Learning

🏴󠁶󠁵󠁭󠁡󠁰󠁿 Overview

Key Features

⚡️ Quick Start

1. Set up the environment

2. Configure training paths

3. Run the full Coach–Player loop

🤗 Converting checkpoints

🧪 Evaluation

1. Set up

2. Run

3. Output layout

📊 Results

❓ FAQ

Q: How is CPMöbius different from other self-play frameworks?

Q: What does "data-free" actually mean here?

Q: How does the Coach receive reward without ground-truth labels?

Q: Why is the framework called Möbius?

🙏 Acknowledgements

💬 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages