✨ Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving Clipping Policy Optimization
[December 5, 2025] 🔍 We propose entropy ratio clipping (ERC) to impose a global constraint on the output distribution of the policy model. Experiments demonstrate that ERC can significantly improve the stability of off-policy training. 📄 The paper is available on arXiv.
[September 26, 2025] 🔍 We further explored GPPO in depth and proposed CE-GPPO, focusing on the impact of ppo-clip tokens on entropy. 📄 The paper is available on arXiv and HuggingFace Daily.
[September 15, 2025] GPPO brings benefits in others' industrial scenarios. 💼✨ Check the Xiaohongshu link. 🔗
[August 12, 2025] 🚀 We released the checkpoint for KlearReasoner-8B, along with the training data.
[August 11, 2025] 🔬 KlearReasoner-8B conducted preliminary exploration of GPPO.
[August 11, 2025] 🏆 We released KlearReasoner-8B, achieving SOTA performance among small-scale 7/8B models.
[August 11, 2025] 📢 KlearReasoner is available on arXiv and HuggingFace Daily.
We present Klear-Reasoner, a model with long reasoning capabilities that demonstrates careful deliberation during problem solving, achieving outstanding performance across multiple benchmarks. We investigate two key issues with current clipping mechanisms in RL: Clipping suppresses critical exploration signals and ignores suboptimal trajectories. To address these challenges, we propose Gradient-Preserving clipping Policy Optimization (GPPO) that gently backpropagates gradients from clipped tokens.
Benchmark accuracy of Klear-Reasoner-8B on AIME 2024/2025 (avg@64), LiveCodeBench V5 (2024/08/01-2025/02/01, avg@8), and v6 (2025/02/01-2025/05/01, avg@8).
Klear-Reasoner is an 8-billion-parameter reasoning model that achieves SOTA performance on challenging math and coding benchmarks:
| Benchmark | AIME 2024 | AIME 2025 | LiveCodeBench V5 | LiveCodeBench V6 |
|---|---|---|---|---|
| Score | 90.5 % | 83.2 % | 66.0 % | 58.1 % |
The model combines:
- Quality-centric long CoT SFT – distilled from DeepSeek-R1-0528.
- Gradient-Preserving Clipping Policy Optimization (GPPO) – a novel RL method that keeps gradients from clipped tokens to boost exploration & convergence.
GPPO is a plug-and-play replacement for PPO/GRPO that keeps the clipped tokens in the computational graph and lets their gradients flow in a bounded, controlled way.
Classic importance-ratio clipping (PPO/GRPO) drops all tokens whose ratio
Two side-effects appear:
-
High-entropy exploratory tokens (large
$r$ , positive advantage) are killed → less exploration. -
Negative trajectories (small
$r$ , negative advantage) are ignored → slower correction.
Let
-
$\delta = r_t^{(j)}(\theta)=\pi_\theta/\pi_{\text{old}}$ (importance ratio) -
$\tilde A^{(j)}$ = group-relative advantage -
$\text{sg}(\cdot)$ = stop-gradient (detach from back-prop)
The GPPO objective is
- Forward: behaves exactly like Clip-Higher.
-
Backward: the fraction
$\frac{1\pm\varepsilon}{\text{sg}(\delta)}$ keeps the clipped magnitude but still propagates a mild gradient.
Let
The per-token gradient is
where
- Never zero → every token contributes to learning.
For finer-grained control:
Empirically we set
The loss of GPPO only requires modifying one line of code based on the PPO/GPPO loss:
-advantages * torch.clamp(ratio, (1 - cliprange_low) / ratio.detach() * ratio, (1 + cliprange_high) / ratio.detach() * ratio)
The complete loss implementation is as follows:
def compute_gppo_loss(
old_log_prob,
log_prob,
advantages,
response_mask,
cliprange=None,
cliprange_low=None,
cliprange_high=None,
clip_ratio_c=3.0,
loss_agg_mode="token-mean",
):
"""Adapted from https://github.com/huggingface/trl/blob/main/trl/trainer/ppo_trainer.py#L1122
Args:
old_log_prob: `(torch.Tensor)`
shape: (bs, response_length)
log_prob: `(torch.Tensor)`
shape: (bs, response_length)
advantages: `(torch.Tensor)`
shape: (bs, response_length)
response_mask: `(torch.Tensor)`
shape: (bs, response_length)
cliprange: (float)
The clip range used in PPO. See https://arxiv.org/abs/1707.06347
cliprange_low: (float)
The lower clip range used in PPO.
cliprange_high: (float)
The higher clip range used in PPO.
clip_ratio_c: (float) default: 3.0
The lower bound of the ratio for dual-clip PPO, See https://arxiv.org/pdf/1912.09729
loss_agg_mode: (str) choices: "token-mean" /
"seq-mean-token-sum" /
"seq-mean-token-mean" /
"seq-mean-token-sum-norm" /
"token-mean" is the default behavior
Returns:
pg_loss: `a scalar torch.Tensor`
policy gradient loss computed via PPO
pg_clipfrac: (float)
the fraction of policy gradient loss being clipped
ppo_kl: (float)
the estimated KL divergence between the latest updating policy and the old sampling policy
pg_clipfrac_lower: (float)
the fraction of policy gradient loss being clipped when the advantage is negative
"""
assert clip_ratio_c > 1.0, (
"The lower bound of the clip_ratio_c for dual-clip PPO should be greater than 1.0,"
+ f" but get the value: {clip_ratio_c}."
)
negative_approx_kl = log_prob - old_log_prob
ratio = torch.exp(negative_approx_kl)
ppo_kl = verl_F.masked_mean(-negative_approx_kl, response_mask)
pg_losses1 = -advantages * ratio
if cliprange_low is None:
cliprange_low = cliprange
if cliprange_high is None:
cliprange_high = cliprange
pg_losses2 = -advantages * torch.clamp(
ratio, (1 - cliprange_low) / ratio.detach() * ratio, (1 + cliprange_high) / ratio.detach() * ratio
)
clip_pg_losses1 = torch.maximum(
pg_losses1, pg_losses2
)
pg_clipfrac = verl_F.masked_mean(torch.gt(pg_losses2, pg_losses1).float(), response_mask)
pg_losses3 = -advantages * clip_ratio_c
clip_pg_losses2 = torch.min(pg_losses3, clip_pg_losses1)
pg_clipfrac_lower = verl_F.masked_mean(
torch.gt(clip_pg_losses1, pg_losses3) * (advantages < 0).float(), response_mask
)
pg_losses = torch.where(advantages < 0, clip_pg_losses2, clip_pg_losses1)
pg_loss = agg_loss(loss_mat=pg_losses, loss_mask=response_mask, loss_agg_mode=loss_agg_mode)
return pg_loss, pg_clipfrac, ppo_kl, pg_clipfrac_lower
Comparison of GPPO, GRPO w/ Clip Higher, and CISPO in mathematical RL training. Both methods are trained from an earlier long-CoT SFT checkpoint with a sequence length of 32K tokens. For GRPO, we use the Clip-Higher strategy from DAPO with the recommended
| Model | AIME2024 avg@64 |
AIME2025 avg@64 |
HMMT2025 avg@64 |
LCB V5 avg@8 |
LCB V6 avg@8 |
|---|---|---|---|---|---|
| AReal-boba-RL-7B | 61.9 | 48.3 | 29.4 | 34.3 | 31.0† |
| MiMo-7B-RL | 68.2 | 55.4 | 35.7 | 57.8 | 49.3 |
| Skywork-OR1-7B | 70.2 | 54.6 | 35.7 | 47.6 | 42.7 |
| AceReason-Nemotron-1.1-7B | 72.6 | 64.8 | 42.9 | 57.2 | 52.1 |
| POLARIS-4B-Preview | 81.2 | 79.4 | 58.7 | 58.5† | 53.0† |
| Qwen3-8B | 76.0 | 67.3 | 44.7† | 57.5 | 48.4† |
| Deepseek-R1-0528-Distill-8B | 86.0 | 76.3 | 61.5 | 61.0† | 51.6† |
| OpenReasoning-Nemotron-7B | 84.7 | 78.2 | 63.5 | 65.6† | 56.3† |
| Klear-Reasoner-8B-SFT | 75.6 | 70.1 | 57.6 | 58.5 | 49.6 |
| Klear-Reasoner-8B | 83.2 | 75.6 | 60.3 | 61.6 | 53.1 |
| w/ 64K Inference Budget | 90.5 | 83.2 | 70.8 | 66.0 | 58.1 |
We report the average
pass@1results (avg@n), with all other evaluation metrics following the DeepSeek-R1 assessment framework (temperature=0.6, top_p=0.95).
git clone https://github.com/suu990901/Klear_Reasoner
cd Klear_Reasoner
pip install -e .
pip install -r requirements.txtFor the code, we use Firejail for the sandbox environment. Additionally, we implemented multi-process control based on Pebble, enabling automatic resource reclamation upon task timeout. For mathematics, we use math_verify for judging.
Please refer to the format of the two provided datasets, Math RL and Code RL, for the training data. The format for a single math entry is as follows:
{"data_source": "math_longcot_math_verify", "prompt": [{"content": "Let $n=9867$. If you calculated $n^{3}-n^{2}$, what would be the unit digit found?\n(a) 0\n(b) 2\n(c) 4\n(d) 6\n(e) 8", "role": "user"}], "ability": "math", "reward_model": {"ground_truth": "4", "style": "rule"}, "__index_level_0__": "29999"} Here, the data_source field is set to "math_longcot_math_verify".
The format for a single code entry is as follows:
{"hash": "47c43857280be8a7557cc36b998b3012", "ability": "code", "data_source": "coder1_longcot", "prompt": [{"content": "You are an expert Python programmer. You will be given a question (problem specification) and will generate a correct Python program that matches the specification and passes all tests.\n\nTakahashi is planning to eat N dishes.\nThe i-th dish he plans to eat is sweet if S_i = sweet, and salty if S_i = salty.\nIf he eats two sweet dishes consecutively, he will feel sick and be unable to eat any more dishes.\nDetermine whether he can eat all the dishes...", "role": "user"}], "reward_model": {"ground_truth": "...", "style": "rule"}} Here, the data_source field is set to "coder1_longcot".
The data_source field affects the choice of verifier.
For multi-node training, ensure all nodes are started and connected via Ray before executing the training script. Below is a brief setup guide for Ray across multiple machines:
On the first node (typically called node0), run:
ray start --head --dashboard-host=0.0.0.0Get the IP address of the master node.
MASTER_IP=$(hostname -I | awk '{print $1}')On each additional worker node (e.g., node1), run the following, replacing the IP with that of your head node:
ray start --address=\"$MASTER_IP:6379\"Run the following script on the master node to start the training task.
bash recipe/dapo/perf_run_dapo_ours_math.sh # For Math RL
bash recipe/dapo/perf_run_dapo_ours_code.sh # For Code RLIn the startup script, you need to set the following variables:
YOUR_MODEL_PATH="<your_model_path>"
CKPTS_SAVE_DIR="<ckpts_save_path>"
YOUR_TRAIN_FILE="<train_data_path>"
YOUR_TEST_FILE="<test_data_path>"It is worth noting that for training stability, if you train with a sequence length shorter than 32K, we recommend enabling actor_rollout_ref.actor.overlong_filter=True, as this filters out samples in the rollout that exceed the maximum sequence length.
We observed that when training with a 32K sequence length, the model can still optimize stably even with actor_rollout_ref.actor.overlong_filter=False. However, if the maximum sequence length is reduced to 16K, training becomes highly unstable, regardless of whether GPPO or GRPO is used.
Our exploration of GPPO is still ongoing, so stay tuned. Although we can maintain stability when using native GPPO training on KlearReasoner-8B, we later discovered in other internal business models that if the gradient on the low side of the PPO clip range is too large, it can restrict exploration and lead to entropy collapse. To address this issue, we propose two solutions:
• General Form of GPPO: Reducing the hyperparameter beta1 can decrease the gradient backpropagated from the low side of the clip range. Based on our preliminary experiments, setting beta1 to 0.25 or 0.5 and beta2 to 1 yields good performance. Example script:
bash recipe/dapo/perf_run_dapo_ours_math_general_gppo.sh # For Math RL• Retaining only the gradient from the high side of the clip range: This approach significantly alleviates entropy collapse and encourages the model to explore. Example script:
bash recipe/dapo/perf_run_dapo_ours_math_only_high.sh # For Math RLWhen we expand the inference budget to 64K and adopt the YaRN method with a scaling factor of 2.5.
The evaluation data for AIME24, AIME25, and HMMT2025 are available in our GitHub repository under the benchmarks directory. For LiveCodeBench, please download the data from the official website.
You can run the following commands to perform inference and evaluation:
git clone https://github.com/suu990901/KlearReasoner
cd KlearReasoner/benchmarks
python inference.py --model "<KlearReasoner-8B_path>" --n 64 --dataset_path ./benchmarks/aime24.qs.jsonl
python judge_math.py "<path_to_inference_results>"If you find this work helpful, please cite our paper:
@article{su2025entropy,
title={Entropy Ratio Clipping as a Soft Global Constraint for Stable Reinforcement Learning},
author={Su, Zhenpeng and Pan, Leiyu and Lv, Minxuan and Mei, Tiehua and Lin, Zijia and Li, Yuntao and Hu, Wenping and Tang, Ruiming and Gai, Kun and Zhou, Guorui},
journal={arXiv preprint arXiv:2512.05591},
year={2025}
}@misc{su2025cegppocontrollingentropygradientpreserving,
title={CE-GPPO: Controlling Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning},
author={Zhenpeng Su and Leiyu Pan and Minxuan Lv and Yuntao Li and Wenping Hu and Fuzheng Zhang and Kun Gai and Guorui Zhou},
year={2025},
eprint={2509.20712},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2509.20712},
}@article{DBLP:journals/corr/abs-2508-07629,
author = {Zhenpeng Su and
Leiyu Pan and
Xue Bai and
Dening Liu and
Guanting Dong and
Jiaming Huang and
Wenping Hu and
Fuzheng Zhang and
Kun Gai and
Guorui Zhou},
title = {Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving
Clipping Policy Optimization},
journal = {CoRR},
volume = {abs/2508.07629},
year = {2025},
url = {https://doi.org/10.48550/arXiv.2508.07629},
doi = {10.48550/ARXIV.2508.07629},
eprinttype = {arXiv},
eprint = {2508.07629},
timestamp = {Sat, 13 Sep 2025 14:46:27 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-2508-07629.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}