-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Description
Context
We use verl-agent/VAGEN for multi-turn VLM GRPO training because TRL (HuggingFace) cannot handle multi-turn VLM rollouts — chat template flattening destroys multimodal data before it reaches rollout_func.
Decision doc: docs/verl_agent_decision.md (PR #84)
Upstream Dependency
- TRL #5120: Preserve structured multimodal messages through rollout and generation pipeline (opened Feb 18, 2026, OPEN)
- TRL #5119: Decouple inference backend from rollout & agent logic (OPEN)
When to Revisit
Check quarterly (June, September, December 2026). If any of:
- TRL #5120 is resolved or has a merged fix
- TRL's GRPOTrainer passes multi-turn VLM E2E tests
- TRL release notes announce multi-turn VLM GRPO support
Then:
- Test TRL against our WAA RL environment (
RLEnvironment/WAADesktopEnv) - Benchmark: verl-agent vs TRL on same task (wall time, VRAM, convergence)
- If TRL matches verl-agent AND adds per-step credit assignment (GiGPO equivalent), consider switching
Why We'd Want to Switch
verl-agent is excellent but adds Ray/vLLM complexity. TRL has broader adoption and simpler deployment. Switching would reduce the dependency footprint. But only if TRL also adds per-step credit assignment — without GiGPO-equivalent step-level advantages, training on 15+ step desktop tasks is significantly less sample-efficient.
Related
- PR feat: add VAGEN/verl-agent environment adapter for VLM RL training #84: verl-agent spike (
WAADesktopEnvadapter) openadapt_ml/training/grpo/trainer.py: inline comment tracking this
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels