Official implementation of our NeurIPS-W paper:
You Need Reasoning to Learn Reasoning: The Limitations of Label-Free RL in Weak Base Models
Shuvendu Roy, Hossein Hajimirsadeghi, Mengyao Zhai, and Golnoosh Samei
NeruIPS 2025 Workshop on Mathematical Reasoning and AI
In this work, we systematically investigate the performance of label-free RL methods across different model sizes and reasoning strengths, from 0.5B to 7B parameters. Our empirical analysis reveals critical limitations: label-free RL is highly dependent on the base model's pre-existing reasoning capability, with performance often degrading below baseline levels for weaker models. We find that smaller models fail to generate sufficiently long or diverse chain-of-thought reasoning to enable effective self-reflection, and that training data difficulty plays a crucial role in determining success. To address these challenges, we propose a simple yet effective method for label-free RL that utilizes curriculum learning to progressively introduce harder problems during training and mask no-majority rollouts during training. Additionally, we introduce a data curation pipeline to generate samples with predefined difficulty. Our approach demonstrates consistent improvements across all model sizes and reasoning capabilities, providing a path toward more robust unsupervised RL that can bootstrap reasoning abilities in resource-constrained models.
The project is built upon the openr1. Please follow the OpenR1 installation guidelines. We summarize the main steps below.
Caution
Libraries rely on CUDA 12.4. If you see errors related to segmentation faults, double check the version your system is running with nvcc --version.
To run the code in this project, first, create a Python virtual environment using e.g. uv.
To install uv, follow the UV Installation Guide.
Note
As a shortcut, run make install to setup development libraries (spelled out below). Afterwards, if everything is setup correctly you can try out the Open-R1 models.
uv venv openr1 --python 3.11 && source openr1/bin/activate && uv pip install --upgrade pipTip
For Hugging Face cluster users, add export UV_LINK_MODE=copy to your .bashrc to suppress cache warnings from uv
Next, install vLLM and FlashAttention (use Flash Attention v2.7.4.post1 to avoid ABI mismatches):
uv pip install vllm==0.8.4
uv pip install setuptools && uv pip install flash-attn==2.7.4.post1 --no-build-isolationThis will also install PyTorch v2.6.0 and it is very important to use this version since the vLLM binaries are compiled for it. You can then install the remaining dependencies for your specific use case via pip install -e .[LIST OF MODES]. For most contributors, we recommend:
GIT_LFS_SKIP_SMUDGE=1 uv pip install -e ".[dev]"
pip install lightevalexport ACCELERATE_LOG_LEVEL=info
MODEL=Qwen/Qwen2.5-1.5B
RECIPE=recipes/Qwen2.5-1.5B/intuitor/config_demo.yaml
FILE_NAME=intuitor.py
DIST=fsdp
# Run vllm-serve in the background with nohup
nohup env CUDA_VISIBLE_DEVICES=0 trl vllm-serve --model $MODEL > vllm-serve.log 2>&1 &
VLLM_PID=$!
echo "vLLM server started with PID: $VLLM_PID"
env CUDA_VISIBLE_DEVICES=1,2,3 ACCELERATE_LOG_LEVEL=info \
accelerate launch --config_file recipes/accelerate_configs/$DIST.yaml --num_processes=3 \
src/open_r1/$FILE_NAME --config $RECIPE
TRAINING_PID=$!
echo "Training process started with PID: $TRAINING_PID"
export ACCELERATE_LOG_LEVEL=info
MODEL=Qwen/Qwen2.5-1.5B
RECIPE=recipes/Qwen2.5-1.5B/cuma/config.yaml
FILE_NAME=cuma.py
DIST=fsdp
# Run vllm-serve in the background with nohup
nohup env CUDA_VISIBLE_DEVICES=0 trl vllm-serve --model $MODEL > vllm-serve.log 2>&1 &
VLLM_PID=$!
echo "vLLM server started with PID: $VLLM_PID"
# Run accelerate launch in the background with nohup
env CUDA_VISIBLE_DEVICES=1,2,3 ACCELERATE_LOG_LEVEL=info \
accelerate launch --config_file recipes/accelerate_configs/$DIST.yaml --num_processes=3 \
src/open_r1/$FILE_NAME --config $RECIPE
TRAINING_PID=$!
echo "Training process started with PID: $TRAINING_PID"
MODEL=TRAINED-CHECKPOINT-PATH
MODEL_ARGS="model_name=$MODEL,dtype=bfloat16,max_model_length=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"
OUTPUT_DIR=evals/$MODEL
TASK=math_500
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m lighteval.__main__ vllm $MODEL_ARGS "lighteval|$TASK|0|0" \
--use-chat-template \
--output-dir $OUTPUT_DIR
TASK=gsm8k
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m lighteval.__main__ vllm $MODEL_ARGS "leaderboard|$TASK|5|0" \
--use-chat-template \
--output-dir $OUTPUT_DIR
TASK=aime24
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m lighteval.__main__ vllm $MODEL_ARGS "lighteval|$TASK|0|0" \
--use-chat-template \
--output-dir $OUTPUT_DIR
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m lighteval.__main__ vllm $MODEL_ARGS "extended|lcb:codegeneration|0|0" \
--use-chat-template \
--output-dir $OUTPUT_DIR
TASK=gpqa:diamond
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m lighteval.__main__ vllm $MODEL_ARGS "lighteval|$TASK|0|0" \
--use-chat-template \
--output-dir $OUTPUT_DIR
We provide a script for curating additional data of predefined difficulty to ensure better model performance. Please use data_curation.py to curate these data.
We thank the authors of Intuitor for releasing their code.
@article{roy2025,
title={You Need Reasoning to Learn Reasoning: The Limitations of Label-Free RL in Weak Base Models},
author={Shuvendu Roy, Hossein Hajimirsadeghi, Mengyao Zhai, Golnoosh Samei},
journal={NeruIPS Workshop on Mathematical Reasoning and AI},
year={2025}
}