Matching Reward Distributions via Flow Balance
📄 arXiv Paper | 🤗 #1 Paper of the Day
𝕏 Post 1 | 𝕏 Post 2 | 𝕏 Post 3 | 𝕏 Post 4
FlowRL is a flow-balanced reinforcement learning method that matches full reward distributions instead of maximizing rewards, promoting diverse exploration and generalizable reasoning trajectories in LLMs.
| Base Model | Domain | WandB Logs | Hugging Face Model |
|---|---|---|---|
| Qwen2.5-7B | Math | 🔗 View Run | 🤗 Model |
| DeepSeek-7B | Code | 🔗 View Run | 🤗 Model |
| Qwen2.5-32B | Math | - | 🤗 Model |
There are three ways to use FlowRL:
⭐ We recommend using Option 1 as the default choice. Since verl updates frequently, the newest versions may have unstable factors such as training and inference mismatches. Option 1 uses verl 0.4.0, which is stable and has been thoroughly tested with our paper results.
For exact reproduction of results from the paper, use the original repository with verl 0.4.0:
👉 Original Code: https://github.com/Xuekai-Zhu/FlowRL
Install verl first before using FlowRL.
# Option A: Download our pre-processed datasets directly
bash preprocess/down_load_dataset.sh
# Move data to default directory
mv data/xuekai/flowrl-data-collection/math_data data/math_data
mv data/xuekai/flowrl-data-collection/code_data data/code_data# Option B: Process data from original sources
# For detailed processing instructions, see data/README.mdFor Math Tasks: Qwen/Qwen2.5-7B (default in script) ; Qwen/Qwen2.5-32B
For Code Tasks: deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
# Download default model (Qwen2.5-7B for math)
bash preprocess/down_load_model.sh
# For other models, modify MODEL_NAME in the script before runningcd verl_FlowRL
# For 7B math training
bash command/training/math/flowrl_7B_math.sh
# For 32B math training
bash command/training/math/flowrl_32B_math.sh
# For 7B code training
bash command/training/code/flowrl_7B_code.shFor running FlowRL using the latest verl framework:
Latest verl:
# Prepare dataset
bash recipe/flowrl/prepare/prepare_data.sh
# Prepare model
bash recipe/flowrl/prepare/prepare_model.sh# Train FlowRL with Qwen2.5-7B
bash recipe/flowrl/run_flowrl_qwen2.5_7b.shIf you want to implement FlowRL in your own codebase, we provide a detailed implementation guide:
This guide walks you through the key components and steps needed to integrate FlowRL into your existing training pipeline.
After training your FlowRL models, you can evaluate them using the following commands:
cd verl_Test
# First merge the model
bash command/eval/merge_model.sh
# For math testing
bash command/eval/math/flowrl_math_test.sh
# For code testing
bash command/eval/code/flowrl_code_test.shReference: For verl v0.5.0.dev merge model script, see merge_model.sh
If you think this repo helps you, please kindly consider citing our paper:
@article{zhu2025flowrl,
title={FlowRL: Matching Reward Distributions for LLM Reasoning},
author={Zhu, Xuekai and Cheng, Daixuan and Zhang, Dinghuai and Li, Hengli and Zhang, Kaiyan and Jiang, Che and Sun, Youbang and Hua, Ermo and Zuo, Yuxin and Lv, Xingtai and others},
journal={arXiv preprint arXiv:2509.15207},
year={2025}
}