RAGEN: Training Agents by Reinforcing Reasoning

RAGEN leverages reinforcement learning to train LLM reasoning agents in interactive, stochastic environments.
We strongly believe in the future of RL + LLM + Agents. The release is a minimally viable leap forward.

Overview

Reinforcement Learning (RL) with rule-based rewards has shown promise in enhancing reasoning capabilities of large language models (LLMs). However, existing approaches have primarily focused on static, single-turn tasks like math reasoning and coding. Extending these methods to agent scenarios introduces two fundamental challenges:

Multi-turn Interactions: Agents must perform sequential decision-making and react to environment feedback
Stochastic Environments: Uncertainty where identical actions can lead to different outcomes

RAGEN addresses these challenges through:

A Markov Decision Process (MDP) formulation for agent tasks
Reason-Interaction Chain Optimization (RICO) algorithm that optimizes entire trajectory distributions
Progressive reward normalization strategies to handle diverse, complex environments

Algorithm

RAGEN introduces a reinforcement learning framework to train reasoning-capable LLM agents that can operate in interactive, stochastic environments.

The Reasoning-Interaction Chain Optimization (RICO) framework with two interleaved stages: rollout stage and update stage. LLM iteratively generates reasoning-guided actions to interact with the environment to obtain trajectory-level rewards, normalized for LLM update to jointly optimize reasoning and action strategies.

Algorithm

RAGEN introduces a reinforcement learning framework to train reasoning-capable LLM agents that can operate in interactive, stochastic environments. The framework consists of two key components:

> MDP Formulation

We formulate agent-environment interactions as Markov Decision Processes (MDPs) where states and actions are token sequences, allowing LLMs to reason over environment dynamics. At time t, state $s_t$ transitions to the next state through action $a_t$ following a transition function. The policy generates actions given the trajectory history. The objective is to maximize expected cumulative rewards across multiple interaction turns.

> Reasoning-Interaction Chain Optimization

RICO enables LLMs to jointly optimize reasoning and action strategies over entire trajectories. The algorithm alternates between two phases:

Rollout Stage: Reasoning-Interaction Chain Generation

Given an initial state, the LLM generates multiple trajectories. At each step, the model receives the trajectory history and generates a reasoning-guided action: <think>...</think><ans> action </ans>. The environment receives the action and returns feedback (reward and next state).

Update Stage: Multi-turn Trajectory Optimization

After generating trajectories, we train LLMs to optimize expected rewards. Instead of step-by-step optimization, RICO optimizes entire trajectories using importance sampling. This approach enables long-horizon reasoning while maintaining computational efficiency.

> Reward Normalization Strategies

We implement three progressive normalization strategies to stabilize training:

ARPO: Preserves raw rewards directly
BRPO: Normalizes rewards across each training batch using batch statistics
GRPO: Normalizes within prompt groups to balance learning across varying task difficulties

Environment Setup

For detailed setup instructions, please check our documentation. Here's a quick start guide:

# Setup environment and download data (7MB)
bash scripts/setup_ragen.sh
python scripts/download_data.py

If this fails, you can follow the manual setup instructions in scripts/setup_ragen.md.

Training Models

Here's how to train models with RAGEN:

Create data

We provide 10k first-round-observation data for both Sokoban and FrozenLake tasks.

# Basic data creation
bash scripts/create_data.sh

# Or for research purposes, create more comprehensive data
bash scripts/create_data_full.sh

Export variables and train

We provide default configuration in verl/trainer/config/ppo_trainer.yaml. To train:

bash train.sh sokoban \
    model.experiment_name=new_test

# Override config parameters as needed
bash train.sh sokoban \
    model.experiment_name=new_test_debug \
    training.train_batch_size=128 \
    training.ppo_batch_size=64

Supervised Finetuning (Optional)

For supervised finetuning with LoRA:

Create supervised finetuning data:

bash sft/generate_data.sh <env_type>

Finetune the model:

bash sft/finetune_lora.sh <env_type> <num_gpus> <save_path>

Merge LoRA weights with the base model:

python sft/utils/merge_lora.py \
    --base_model_name <base_model_name> \
    --lora_model_path <lora_model_path> \
    --output_path <output_path>

Visualization

To visualize agent trajectories:

Set visualization parameters in train.sh:

logging.log_images=True
logging.log_image_dir=log/trajectory
logging.log_image_step_size=4
logging.log_n_image_per_batch=32

View the visualizations:

cd log/trajectory
python -m http.server 8000
# Access at http://localhost:8000/[EXP_NAME]/step_[STEP_NUM]/trajectory_data_[ID].html

For proper font rendering:

sudo apt-get install fonts-noto-cjk

Download visualization data from wandb:

from ragen.utils.wandb import download_wandb
download_wandb("RUN_ID") # e.g., 9o465jqj

Performance

We evaluate RAGEN across multiple model sizes and configurations. Below are results from our Sokoban experiments using Qwen-2.5-{0.5B, 3B}-{Instruct, None} and DeepSeek-R1-Distill-Qwen-1.5B.

Key observations:

Instruct-finetuned models show early advantages but the gap narrows as training progresses
Larger models (3B) generally outperform smaller models (0.5B), though the advantage is not dramatic
The R1-distilled 1.5B model initially underperforms compared to 0.5B models
Training has not yet converged in these experiments

Our analysis reveals two key aspects of LLM agent training with RL:

Prompt diversity: Balancing observation variety and effective response comparison
Online rollout frequency: Mediating between training stability and data recency

Example Trajectories

Visualization of agent reasoning on the Sokoban task:

The visualizations show how the agent reasons through sequential steps to solve the puzzle.

Case Studies

We provide several case studies showing the model's behavior:

More case studies will be added to showcase both successful reasoning patterns and failure modes.

Feedback

We welcome all forms of feedback! Please raise an issue for bugs, questions, or suggestions. This helps our team address common problems efficiently and builds a more productive community.

Contributors

Zihan Wang*, Kangrui Wang*, Qineng Wang*, Pingyue Zhang*, Linjie Li*, Zhengyuan Yang, Kefan Yu, Minh Nhat Nguyen, Monica Lam, Yiping Lu, Kyunghyun Cho, Jiajun Wu, Li Fei-Fei, Lijuan Wang, Yejin Choi, Manling Li

*:Equal Contribution.

Acknowledgements

We thank DeepSeek for providing the DeepSeek-R1 model and ideas. We thank the veRL team for their infrastructure. We thank the TinyZero team for their discoveries that inspired our early exploration. We thank Han Liu, Xinyu Xing, Li Erran Li, Akari Asai, Eiso Kant, Lu Lu, Runxin Xu, Huajian Xin, Zijun Liu, Weiyi Liu, Weimin Wu, Yibo Wen, Jiarui Liu, Lorenzo Xiao, Ishan Mukherjee, Anabella Isaro, Haosen Sun, How-Yeh Wan, Lester Xue, Weiyi Liu for insightful discussions.

Citation

@misc{RAGEN,
  author       = {Zihan Wang* and Kangrui Wang* and Qineng Wang* and Pingyue Zhang* and Linjie Li* and Zhengyuan Yang and Kefan Yu and Minh Nhat Nguyen and Monica Lam and Yiping Lu and Kyunghyun Cho and Jiajun Wu and Li Fei-Fei and Lijuan Wang and Yejin Choi and Manling Li},
  title        = {Training Agents by Reinforcing Reasoning},
  year         = {2025},
  organization = {GitHub},
  url          = {https://github.com/ZihanWang314/ragen},
}

Name		Name	Last commit message	Last commit date
Latest commit History 256 Commits
cases		cases
config		config
public		public
ragen		ragen
scripts		scripts
sft		sft
verl		verl
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cmd.md		cmd.md
cmd_analysis.md		cmd_analysis.md
ragen_cmd.md		ragen_cmd.md
requirements.txt		requirements.txt
setup.py		setup.py
train.sh		train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAGEN: Training Agents by Reinforcing Reasoning

Overview

Algorithm

Algorithm

> MDP Formulation

> Reasoning-Interaction Chain Optimization

Rollout Stage: Reasoning-Interaction Chain Generation

Update Stage: Multi-turn Trajectory Optimization

> Reward Normalization Strategies

Environment Setup

Training Models

Create data

Export variables and train

Supervised Finetuning (Optional)

Visualization

Performance

Example Trajectories

Case Studies

Feedback

Contributors

Acknowledgements

Citation

About

Releases

Packages

Contributors 5

Languages

License

ZihanWang314/RAGEN

Folders and files

Latest commit

History

Repository files navigation

RAGEN: Training Agents by Reinforcing Reasoning

Overview

Algorithm

Algorithm

> MDP Formulation

> Reasoning-Interaction Chain Optimization

Rollout Stage: Reasoning-Interaction Chain Generation

Update Stage: Multi-turn Trajectory Optimization

> Reward Normalization Strategies

Environment Setup

Training Models

Create data

Export variables and train

Supervised Finetuning (Optional)

Visualization

Performance

Example Trajectories

Case Studies

Feedback

Contributors

Acknowledgements

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages