Skip to content

corail-research/Imitation-Guided_World_Models

Repository files navigation

Imitation-Guided World Models for Multi-Agent Train Rescheduling

Max Bourgeat1,2Antoine Legrain1,2,3Quentin Cappart1,2,4

1Polytechnique Montréal, Montreal, Canada
2CIRRELT, Montreal, Canada
3GERAD, Montreal, Canada
4UCLouvain, Louvain-la-Neuve, Belgium

🏠 About

Managing railway disruptions is a complex multi-agent routing problem where a single train failure can propagate delays across the network. Traditional approaches rely on heuristic optimization solvers, which are effective but assume access to a global system view and require substantial expert design, limiting their applicability and generalization. Reinforcement learning (RL) offers an alternative by learning adaptive strategies from interactions with the environment. In this paper, we show that none of the available paradigms is sufficient in isolation: (i) heuristic solvers encode valuable global expertise but cannot be deployed directly, (ii) world models improve sample efficiency but struggle to leverage expert knowledge, and (iii) pure RL can adapt policies but often lacks stability without strong guidance. We propose a hybrid framework that integrates the strengths of these approaches. First, imitation learning transfers knowledge from a global expert solver to initialize a neural policy. Then, model-based RL fine-tunes this policy using the DreamerV2 world model to enhance generalization and responsiveness to local perturbations. Our method builds on the Multi-Agent Model-Based Architecture (MAMBA) to model agent interactions and addresses the challenge of transferring expertise from global solvers to decentralized agents operating on local latent observations. Experiments on a train rescheduling problem using the Flatland environment show that our method outperforms MAMBA, improving performance by up to 23% on difficult instances. This highlights the benefit of combining imitation learning with world-model-based multi-agent RL for complex transportation networks.

Dialogue_Teaser

🔍 Overview

Framework

Dialogue_Teaser

Instructions

This code accompanies the paper "Imitation-Guided World Models for Multi-Agent Train Rescheduling".

Installation

python3.9 is required

pip install wheel
pip install flatland-2.2.2/
pip install -r requirements.txt 

Installation the solver

To install the solver, follow the procedure you can found here:
https://github.com/Jiaoyang-Li/Flatland
Once you installed it, you need to move the libPythonCBS.so file into agent/solver/.

Usage

All the configuration files are already setup to replicate the main results of the paper. If you want to replicate the ablation studies you should modify the files configs/dreamer/DreamerControllerConfig.py and configs/dreamer/DreamerLearnerConfig.py.
For example, if you want to modify the expert for the random or greedy expert you should replace self.EXPERT_TO_FOLLOW = "solver" by self.EXPERT_TO_FOLLOW = "random" or self.EXPERT_TO_FOLLOW = "greedy" in configs/dreamer/DreamerControllerConfig.py.
If you want to remove the world model (DreamerV2) you should set self.ABLATION_WM to True in configs/dreamer/DreamerControllerConfig.py and configs/dreamer/DreamerLearnerConfig.py.
And if you want to train in a full imitation learning fashion without reinforcement learning you should set self.STEPS_SEQUENTIAL_RL_PHASE to 0 in configs/dreamer/DreamerLearnerConfig.py.

To train the model from scratch :

python3 train.py --n_workers 2 --env flatland --env_name 5_agents --time_limit time_limit_in_second --steps steps_in_the_env

In this repo but not in the paper

You'll find some options not mentioned in the paper, they corresponds to methods we tried but did not performed well. It includes the use of DAgger algorithm which failed beacause of the sensibility of the expert solver to random actions. Indeed, the expert solver does not perform well if periodically we follow a policy which differs from it. We also tried to use the expert action coupled with the observation as an input of the world model. We tried a variant of the DAgger algorithm where we do not use a sequential approach (Imitation Learning then Reinforcement Learning) but a hybrid approach where we optimize both the imitation and the reinforcement losses at the same time with a decreasing weight importance for the imitation loss but then again it perfomed badly.

Code Structure

  • agent contains implementation of MAMBA
    • controllers contains logic for inference
    • learners contains logic for learning the agent
    • memory contains buffer implementation
    • models contains architecture of MAMBA
    • optim contains logic for optimizing loss functions
    • runners contains logic for running multiple workers
    • solver contains logic for using the solver
    • utils contains helper functions
    • workers contains logic for interacting with environment
  • env contains environment logic
  • networks contains neural network architectures

👏 Acknowledgements

  • MAMBA: Our codebase is built upon MAMBA.
  • MAPF Solver: Our codebase used the winning solver solution of the Flatland challenge.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors