A collaborative alternative to self-play. Teach Large Language Models to reason without any external training data through a Coach that proposes targeted instructions and a Player that learns by solving them.
Training strong reasoning LLMs traditionally depends on massive, human-curated tasks and labels — through SFT or RL on reasoning-specific data. This dependence is increasingly unsustainable, and the diminishing returns of supervision-heavy paradigms are already visible in practice.
CPMöbius breaks this dependence with a collaborative Coach–Player paradigm for data-free reinforcement learning of reasoning models. Unlike adversarial self-play (e.g., R-Zero, AZR), CPMöbius — inspired by real-world sports coaching and multi-agent collaboration — treats the Coach and Player as independent but cooperative roles:
- The Coach 🧭 proposes instructions targeted at the Player's current capability and is rewarded based on changes in the Player's performance, not on beating the Player.
- The Player ⚽ is rewarded for solving the increasingly instructive tasks generated by the Coach.
This forms a Möbius-like cooperative optimization loop: the Coach learns what the Player needs next, and the Player learns how to solve it — with no external problem set, no human labels, and no adversarial pressure.
- Fully Data-Free. No external problems, no annotated solutions, no SFT warm-up on reasoning data.
- Collaborative. The Coach is rewarded for the Player's improvement, sidestepping the instability and reward hacking that plague competitive self-play.
- Adaptive Curriculum. Tasks that are generated by the coach are continuously calibrated to the Player's evolving frontier, keeping problems neither too easy nor too hard.
- Flexible Environment Feedback. AMC is not the only held-out validation set that can provide environment feedback; other validation sets such as AIME, Minerva, and OlympiadBench can also be used.
- Outperforms Unsupervised Baselines. CPMöbius outperforms RENT on overall accuracy and R-Zero on OOD accuracy under matched settings.
git clone https://github.com/thunlp/CPMobius.git
cd CPMobius
conda env create -f env.yml
conda activate cpmobiusPick the launch script for your base model and edit the placeholders in it:
| Base model | Script |
|---|---|
| Qwen2.5-Math-1.5B | scripts/run_qwen2.5_math_1.5b.sh |
| Qwen2.5-Math-7B-Instruct | scripts/run_qwen2.5_math_7b_instruct.sh |
| OpenMath-Nemotron-1.5B | scripts/run_openmath_nemotron_1.5B.sh |
| OctoThinker-3B-Hybrid-0 | scripts/run_octothinker_3B_hybrid_zero.sh |
# if you use wandb instead of swanlab, set export WANDB_API_KEY='Your Wandb API Key' and change trainer.logger=['console','swanlab'] to trainer.logger=['console','wandb'] in the script. Also change SWANLAB_LOG_DIR to WANDB_LOG_DIR
export SWANLAB_API_KEY='Your Swanlab API Key'
SWANLAB_LOG_DIR='Your Swanlab Log Directory'
VAL_FILES="Your path to validation parquet file"
PLAYER_MODEL_PATH="Your path to Qwen2.5-Math-1.5B"
COACH_MODEL_PATH="Your path to coach model"
CKPT_DIR="Your Checkpoint Directory"bash scripts/run_qwen2.5_math_1.5b.shThe script runs the Coach–Player training loop and saves checkpoints under CKPT_DIR.
bash utils/convert.sh <path_to_your_checkpoint1> <path_to_your_checkpoint2> ...evaluation/run_math_all.sh evaluates a checkpoint on six math benchmarks through a single chat-style pipeline (utils/chat_eval.py + utils/parquet_loader.py). The parquet files in evaluation/data/ are pre-replicated (suffix _xN), so one greedy sample per row already gives self-consistency over N rollouts — to change N, regenerate the parquet.
cd evaluation
conda create -n cpmobius_eval python=3.10 -y
conda activate cpmobius_eval
pip install -r requirements_prime.txt # pulls vllm, transformers, pandas, pyarrow# all six benchmarks
bash run_math_all.sh --model /path/to/checkpoint
# subset + custom sampling + output / GPU overrides
OUTPUT_ROOT=./my_results CUDA_VISIBLE_DEVICES=0,1 \
bash run_math_all.sh --model /path/to/ckpt \
--tasks "math500,aime,aime2025" \
--temperature 0.7 --top-p 0.9 --repetition-penalty 1.05| Flag / env | Default | Notes |
|---|---|---|
--model |
(required) | HuggingFace path or local checkpoint |
--tasks |
all | comma-separated subset |
--run-name |
basename($MODEL) |
sub-directory under $OUTPUT_ROOT |
--temperature |
0.7 |
sampling temperature |
--top-p |
0.9 |
nucleus top-p |
--repetition-penalty |
1.05 |
vLLM repetition penalty |
--max-tokens |
4096 |
per-sample max new tokens |
DATA_ROOT |
evaluation/data |
parquet root |
OUTPUT_ROOT |
evaluation/results |
output root |
CUDA_VISIBLE_DEVICES |
inherits shell | GPUs forwarded to vLLM |
Each task writes to $OUTPUT_ROOT/<run-name>/<task>/<sampling-tag>/, where <sampling-tag> is <temp>-<top_p>-<rep> with dots replaced by p (e.g. 0p7-0p9-1p05):
completions.jsonl # raw model completions
results.jsonl # extracted answer + correctness per row
results_summary.json # {benchmark, total, correct, accuracy}
results_summary.txt # human-readable one-liner
CPMöbius is evaluated on a suite of mathematical reasoning benchmarks, with both in-distribution (ID) and out-of-distribution (OOD) averages reported. We select four base models for our training experiments, representing the three main stages of a typical LLM training lifecycle: pre-training, supervised fine-tuning (SFT), and reinforcement learning.
Performance comparison between CPMöbius and baseline methods on mathematical reasoning benchmarks. Overall Average is the mean over all benchmarks. OOD Average is the mean over all benchmarks except AMC, because RENT was trained on AMC and CPMöbius validation also used AMC — separating it gives a fair in-distribution (AMC) vs. out-of-distribution comparison. Bold values indicate the best performance for each metric.
| Models | Average | OOD Average | AMC | AIME 2024 | AIME 2025 | Minerva | MATH | Olympiad |
|---|---|---|---|---|---|---|---|---|
| Qwen2.5-Math-1.5B | ||||||||
| Base Model | 23.3 | 19.8 | 34.6 | 6.2 | 2.8 | 16.3 | 56.2 | 23.4 |
| R-Zero (Iter 3) | 27.1 | 24.7 | 39.2 | 9.8 | 5.0 | 19.3 | 62.4 | 26.8 |
| RENT | 27.1 | 24.7 | 39.3 | 10.0 | 5.0 | 19.0 | 62.2 | 27.1 |
| CPMöbius | 28.8 | 26.8 | 39.4 | 9.8 | 5.4 | 28.0 | 63.1 | 26.9 |
| OpenMath-Nemotron-1.5B | ||||||||
| Base Model | 59.5 | 54.9 | 82.3 | 55.6 | 43.3 | 25.1 | 89.4 | 61.0 |
| R-Zero (Iter 3) | – | – | – | – | – | – | – | – |
| RENT | 61.7 | 56.5 | 87.7 | 55.0 | 46.0 | 24.2 | 90.7 | 66.7 |
| CPMöbius | 62.1 | 57.0 | 87.5 | 54.9 | 46.9 | 24.3 | 91.2 | 67.9 |
| OctoThinker-3B-Hybrid-Zero | ||||||||
| Base Model | 21.3 | 20.6 | 24.6 | 3.9 | 1.7 | 16.3 | 57.9 | 23.4 |
| R-Zero (Iter 3) | 20.5 | 19.5 | 25.9 | 2.0 | 0.3 | 14.6 | 58.1 | 22.3 |
| RENT | 23.0 | 21.7 | 29.2 | 7.3 | 2.1 | 15.0 | 60.2 | 24.1 |
| CPMöbius | 23.6 | 22.0 | 28.0 | 4.8 | 1.7 | 22.1 | 60.4 | 24.7 |
| Qwen2.5-Math-7B-Instruct | ||||||||
| Base Model | 35.8 | 33.0 | 49.2 | 9.0 | 6.3 | 34.6 | 78.0 | 37.4 |
| R-Zero (Iter 3) | 36.9 | 34.2 | 50.5 | 9.5 | 7.4 | 32.7 | 83.3 | 38.1 |
| RENT | 39.2 | 37.6 | 53.1 | 10.8 | 9.9 | 38.8 | 83.8 | 38.8 |
| CPMöbius | 40.7 | 38.4 | 55.6 | 11.8 | 9.6 | 44.9 | 84.2 | 38.3 |
Headline takeaways:
- CPMöbius improves overall accuracy by +4.9 on Qwen2.5-Math-7B-Instruct without any external data.
- On OOD benchmarks, CPMöbius gains +5.4 exceeding R-Zero by +4.2.
- CPMöbius exceeds RENT by +1.5 on overall accuracy, demonstrating that targeted task generation outperforms pure entropy minimization.
Full per-benchmark numbers (MATH, AIME, AMC, OlympiadBench, College-Math, GaoKao, etc.) are in Section 5 of the paper.
A: R-Zero and AZR are both very excellent frameworks and they cast the question generator and the solver as adversaries, i.e., the generator is rewarded for finding problems the solver fails on. In contrast, CPMöbius is collaborative: the Coach is rewarded by the Player's improvement (a change-in-performance signal), so the Coach has no incentive to push the Player off a cliff. Empirically this yields a more stable curriculum, especially on OOD benchmarks.
A: It means CPMöbius does not consume any human-written reasoning problems or human-annotated solutions during the Coach–Player co-evolution loop. The base model and tokenizer are pretrained as usual, and the evaluation benchmarks remain held-out. No external math problem set (e.g., MATH train, NuminaMath, GSM8K) is used during training.
A: The Coach is rewarded based on changes in the Player's performance on a held-out probe set evaluated under self-consistency / majority-voting style pseudo-labels. This avoids the need for human labels while still giving the Coach a directional signal toward "instructive" task distributions. See Section 3 of the paper.
A: The Coach–Player loop has no fixed "stronger" or "weaker" side; the two roles co-evolve along a single continuous optimization trajectory, like a Möbius strip with no separate inside and outside.
Our RL training stack is built on veRL. We utilize vLLM for rollouts. We use scripts from PRIME for evaluation. We thank all of authors at THUNLP for their excellent work.
If our work is useful for you, please consider citing the paper:
@article{li2026cpmobius,
title={CPMobius: Iterative Coach-Player Reasoning for Data-Free Reinforcement Learning},
author={Li, Ran and Liu, Zeyuan and Chen, Yinghao and He, Bingxiang and Yuan, Jiarui and Fu, Zixuan and Chen, Weize and Hu, Jinyi and Liu, Zhiyuan and Sun, Maosong},
journal={arXiv preprint arXiv:2602.02979},
year={2026}
}

