Skip to content

monitor: track TRL #5120 for future migration from verl-agent #85

@abrichr

Description

@abrichr

Context

We use verl-agent/VAGEN for multi-turn VLM GRPO training because TRL (HuggingFace) cannot handle multi-turn VLM rollouts — chat template flattening destroys multimodal data before it reaches rollout_func.

Decision doc: docs/verl_agent_decision.md (PR #84)

Upstream Dependency

When to Revisit

Check quarterly (June, September, December 2026). If any of:

  1. TRL #5120 is resolved or has a merged fix
  2. TRL's GRPOTrainer passes multi-turn VLM E2E tests
  3. TRL release notes announce multi-turn VLM GRPO support

Then:

  • Test TRL against our WAA RL environment (RLEnvironment / WAADesktopEnv)
  • Benchmark: verl-agent vs TRL on same task (wall time, VRAM, convergence)
  • If TRL matches verl-agent AND adds per-step credit assignment (GiGPO equivalent), consider switching

Why We'd Want to Switch

verl-agent is excellent but adds Ray/vLLM complexity. TRL has broader adoption and simpler deployment. Switching would reduce the dependency footprint. But only if TRL also adds per-step credit assignment — without GiGPO-equivalent step-level advantages, training on 15+ step desktop tasks is significantly less sample-efficient.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions