RAGEN leverages reinforcement learning to train LLM reasoning agents in interactive, stochastic environments.
We strongly believe in the future of RL + LLM + Agents. The release is a minimally viable leap forward.
Reinforcement Learning (RL) with rule-based rewards has shown promise in enhancing reasoning capabilities of large language models (LLMs). However, existing approaches have primarily focused on static, single-turn tasks like math reasoning and coding. Extending these methods to agent scenarios introduces two fundamental challenges:
- Multi-turn Interactions: Agents must perform sequential decision-making and react to environment feedback
- Stochastic Environments: Uncertainty where identical actions can lead to different outcomes
RAGEN addresses these challenges through:
- A Markov Decision Process (MDP) formulation for agent tasks
- Reason-Interaction Chain Optimization (RICO) algorithm that optimizes entire trajectory distributions
- Progressive reward normalization strategies to handle diverse, complex environments
RAGEN introduces a reinforcement learning framework to train reasoning-capable LLM agents that can operate in interactive, stochastic environments.
The Reasoning-Interaction Chain Optimization (RICO) framework with two interleaved stages: rollout stage and update stage. LLM iteratively generates reasoning-guided actions to interact with the environment to obtain trajectory-level rewards, normalized for LLM update to jointly optimize reasoning and action strategies.
RAGEN introduces a reinforcement learning framework to train reasoning-capable LLM agents that can operate in interactive, stochastic environments. The framework consists of two key components:
We formulate agent-environment interactions as Markov Decision Processes (MDPs) where states and actions are token sequences, allowing LLMs to reason over environment dynamics. At time t, state
RICO enables LLMs to jointly optimize reasoning and action strategies over entire trajectories. The algorithm alternates between two phases:
Given an initial state, the LLM generates multiple trajectories. At each step, the model receives the trajectory history and generates a reasoning-guided action: <think>...</think><ans> action </ans>
. The environment receives the action and returns feedback (reward and next state).
After generating trajectories, we train LLMs to optimize expected rewards. Instead of step-by-step optimization, RICO optimizes entire trajectories using importance sampling. This approach enables long-horizon reasoning while maintaining computational efficiency.
We implement three progressive normalization strategies to stabilize training:
- ARPO: Preserves raw rewards directly
- BRPO: Normalizes rewards across each training batch using batch statistics
- GRPO: Normalizes within prompt groups to balance learning across varying task difficulties
For detailed setup instructions, please check our documentation. Here's a quick start guide:
# Setup environment and download data (7MB)
bash scripts/setup_ragen.sh
python scripts/download_data.py
If this fails, you can follow the manual setup instructions in scripts/setup_ragen.md
.
Here's how to train models with RAGEN:
We provide 10k first-round-observation data for both Sokoban and FrozenLake tasks.
# Basic data creation
bash scripts/create_data.sh
# Or for research purposes, create more comprehensive data
bash scripts/create_data_full.sh
We provide default configuration in verl/trainer/config/ppo_trainer.yaml
. To train:
bash train.sh sokoban \
model.experiment_name=new_test
# Override config parameters as needed
bash train.sh sokoban \
model.experiment_name=new_test_debug \
training.train_batch_size=128 \
training.ppo_batch_size=64
For supervised finetuning with LoRA:
- Create supervised finetuning data:
bash sft/generate_data.sh <env_type>
- Finetune the model:
bash sft/finetune_lora.sh <env_type> <num_gpus> <save_path>
- Merge LoRA weights with the base model:
python sft/utils/merge_lora.py \
--base_model_name <base_model_name> \
--lora_model_path <lora_model_path> \
--output_path <output_path>
To visualize agent trajectories:
- Set visualization parameters in
train.sh
:
logging.log_images=True
logging.log_image_dir=log/trajectory
logging.log_image_step_size=4
logging.log_n_image_per_batch=32
- View the visualizations:
cd log/trajectory
python -m http.server 8000
# Access at http://localhost:8000/[EXP_NAME]/step_[STEP_NUM]/trajectory_data_[ID].html
- For proper font rendering:
sudo apt-get install fonts-noto-cjk
- Download visualization data from wandb:
from ragen.utils.wandb import download_wandb
download_wandb("RUN_ID") # e.g., 9o465jqj
We evaluate RAGEN across multiple model sizes and configurations. Below are results from our Sokoban experiments using Qwen-2.5-{0.5B, 3B}-{Instruct, None} and DeepSeek-R1-Distill-Qwen-1.5B.
Key observations:
- Instruct-finetuned models show early advantages but the gap narrows as training progresses
- Larger models (3B) generally outperform smaller models (0.5B), though the advantage is not dramatic
- The R1-distilled 1.5B model initially underperforms compared to 0.5B models
- Training has not yet converged in these experiments
Our analysis reveals two key aspects of LLM agent training with RL:
- Prompt diversity: Balancing observation variety and effective response comparison
- Online rollout frequency: Mediating between training stability and data recency
Visualization of agent reasoning on the Sokoban task:
The visualizations show how the agent reasons through sequential steps to solve the puzzle.
We provide several case studies showing the model's behavior:
More case studies will be added to showcase both successful reasoning patterns and failure modes.
We welcome all forms of feedback! Please raise an issue for bugs, questions, or suggestions. This helps our team address common problems efficiently and builds a more productive community.
Zihan Wang*, Kangrui Wang*, Qineng Wang*, Pingyue Zhang*, Linjie Li*, Zhengyuan Yang, Kefan Yu, Minh Nhat Nguyen, Monica Lam, Yiping Lu, Kyunghyun Cho, Jiajun Wu, Li Fei-Fei, Lijuan Wang, Yejin Choi, Manling Li
*:Equal Contribution.
We thank DeepSeek for providing the DeepSeek-R1 model and ideas. We thank the veRL team for their infrastructure. We thank the TinyZero team for their discoveries that inspired our early exploration. We thank Han Liu, Xinyu Xing, Li Erran Li, Akari Asai, Eiso Kant, Lu Lu, Runxin Xu, Huajian Xin, Zijun Liu, Weiyi Liu, Weimin Wu, Yibo Wen, Jiarui Liu, Lorenzo Xiao, Ishan Mukherjee, Anabella Isaro, Haosen Sun, How-Yeh Wan, Lester Xue, Weiyi Liu for insightful discussions.
@misc{RAGEN,
author = {Zihan Wang* and Kangrui Wang* and Qineng Wang* and Pingyue Zhang* and Linjie Li* and Zhengyuan Yang and Kefan Yu and Minh Nhat Nguyen and Monica Lam and Yiping Lu and Kyunghyun Cho and Jiajun Wu and Li Fei-Fei and Lijuan Wang and Yejin Choi and Manling Li},
title = {Training Agents by Reinforcing Reasoning},
year = {2025},
organization = {GitHub},
url = {https://github.com/ZihanWang314/ragen},
}