Skip to content

Xuekai-Zhu/FlowRL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FlowRL

Matching Reward Distributions via Flow Balance

📄 arXiv Paper | 🤗 #1 Paper of the Day

𝕏 Post 1 | 𝕏 Post 2 | 𝕏 Post 3 | 𝕏 Post 4

FlowRL Overview

Table of Contents

FlowRL Objective

$$ \mathcal{L}_{\text{FlowRL}} = w \cdot \left( \log Z_{\phi}(x) + \frac{1}{|y|} \log \pi_{\theta}(y \mid x) - \beta \hat{r}(x, y) - \frac{1}{|y|} \log \pi_{\text{ref}}(y \mid x) \right)^2 $$

FlowRL is a flow-balanced reinforcement learning method that matches full reward distributions instead of maximizing rewards, promoting diverse exploration and generalizable reasoning trajectories in LLMs.

Trained Models & Experiment Logs

Base Model Domain WandB Logs Hugging Face Model
Qwen2.5-7B Math 🔗 View Run 🤗 Model
DeepSeek-7B Code 🔗 View Run 🤗 Model
Qwen2.5-32B Math - 🤗 Model

Quick Start

There are three ways to use FlowRL:


⭐ We recommend using Option 1 as the default choice. Since verl updates frequently, the newest versions may have unstable factors such as training and inference mismatches. Option 1 uses verl 0.4.0, which is stable and has been thoroughly tested with our paper results.


Option 1: Original Paper Reproduction (verl 0.4.0) ⭐ Recommended

For exact reproduction of results from the paper, use the original repository with verl 0.4.0:

👉 Original Code: https://github.com/Xuekai-Zhu/FlowRL

Step 1: Installation

Install verl first before using FlowRL.

Step 2: Data Preparation

# Option A: Download our pre-processed datasets directly
bash preprocess/down_load_dataset.sh
# Move data to default directory
mv data/xuekai/flowrl-data-collection/math_data data/math_data
mv data/xuekai/flowrl-data-collection/code_data data/code_data
# Option B: Process data from original sources
# For detailed processing instructions, see data/README.md

Step 3: Model Preparation

For Math Tasks: Qwen/Qwen2.5-7B (default in script) ; Qwen/Qwen2.5-32B

For Code Tasks: deepseek-ai/DeepSeek-R1-Distill-Qwen-7B

# Download default model (Qwen2.5-7B for math)
bash preprocess/down_load_model.sh

# For other models, modify MODEL_NAME in the script before running

Step 4: Training Scripts

cd verl_FlowRL

# For 7B math training
bash command/training/math/flowrl_7B_math.sh

# For 32B math training
bash command/training/math/flowrl_32B_math.sh

# For 7B code training
bash command/training/code/flowrl_7B_code.sh

Option 2: Latest verl Recipe FlowRL

For running FlowRL using the latest verl framework:

Latest verl:

Step 1: Prepare Data and Model

# Prepare dataset
bash recipe/flowrl/prepare/prepare_data.sh

# Prepare model
bash recipe/flowrl/prepare/prepare_model.sh

Step 2: Run Training

# Train FlowRL with Qwen2.5-7B
bash recipe/flowrl/run_flowrl_qwen2.5_7b.sh

Option 3: Implement FlowRL Yourself

If you want to implement FlowRL in your own codebase, we provide a detailed implementation guide:

📖 FlowRL Implementation Guide

This guide walks you through the key components and steps needed to integrate FlowRL into your existing training pipeline.

Testing

After training your FlowRL models, you can evaluate them using the following commands:

cd verl_Test

# First merge the model
bash command/eval/merge_model.sh

# For math testing
bash command/eval/math/flowrl_math_test.sh

# For code testing
bash command/eval/code/flowrl_code_test.sh

Reference: For verl v0.5.0.dev merge model script, see merge_model.sh

Citation

If you think this repo helps you, please kindly consider citing our paper:

@article{zhu2025flowrl,
  title={FlowRL: Matching Reward Distributions for LLM Reasoning},
  author={Zhu, Xuekai and Cheng, Daixuan and Zhang, Dinghuai and Li, Hengli and Zhang, Kaiyan and Jiang, Che and Sun, Youbang and Hua, Ermo and Zuo, Yuxin and Lv, Xingtai and others},
  journal={arXiv preprint arXiv:2509.15207},
  year={2025}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •