You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/grpo_trainer.md
+10-1Lines changed: 10 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -80,6 +80,12 @@ It was shown in the paper [Understanding R1-Zero-Like Training: A Critical Persp
80
80
81
81
</Tip>
82
82
83
+
<Tip>
84
+
85
+
[Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning (Lite PPO)](https://huggingface.co/papers/2508.08221) showed that calculating the mean at the local (group) level and the standard deviation at the global (batch) level enables more robust reward shaping. You can use this scaling strategy by setting `scale_rewards="batch"` in [`GRPOConfig`].
86
+
87
+
</Tip>
88
+
83
89
### Estimating the KL divergence
84
90
85
91
KL divergence is estimated using the approximator introduced by [Schulman et al. (2020)](http://joschu.net/blog/kl-approx.html). The approximator is defined as follows:
To use this formulation, set `loss_type="bnpo"` in [`GRPOConfig`]. Note that we do not reproduce the DAPO formulation exactly: when using gradient accumulation, the loss is computed over the total number of tokens in each batch, not over the accumulated batches. `loss_type="bnpo"` is equivalent to the original DAPO formulation only when `gradient_accumulation_steps=1`.
141
148
142
149
Furthermore, it was demonstrated in the paper [Understanding R1-Zero-Like Training: A Critical Perspective](https://huggingface.co/papers/2503.20783) that the initial GRPO formulation introduces a response length bias. They show that while the DAPO formulation reduces this bias, it does not eliminate it completely. To fully remove this bias, they propose dividing by a constant instead of the sequence length, resulting in the following formulation:
143
150
@@ -162,7 +169,9 @@ While training and evaluating, we record the following reward metrics:
162
169
-`reward/{reward_func_name}/mean`: The average reward from a specific reward function.
163
170
-`reward/{reward_func_name}/std`: The standard deviation of the reward from a specific reward function.
164
171
-`reward`: The overall average reward after applying reward weights.
165
-
-`reward_std`: The standard deviation of the overall reward within each batch after applying reward weights.
172
+
-`reward_std`: The standard deviation of rewards after applying reward weights.
173
+
- If `scale_rewards` is `"group"` or `"none"`, this is the average of the per-group standard deviations.
174
+
- If `scale_rewards` is `"batch"`, this is the standard deviation computed over all rewards in the batch (ignoring groups).
166
175
-`frac_reward_zero_std`: The fraction of samples in the generation batch with a reward std of zero, implying there is little diversity for that prompt (all answers are correct or incorrect).
167
176
-`entropy`: Average entropy of token predictions across generated completions. (If `mask_truncated_completions=True`, masked sequences tokens are excluded.)
168
177
-`kl`: The average KL divergence between the model and the reference model, calculated over generated completions. Logged only if `beta` is nonzero.
Copy file name to clipboardExpand all lines: docs/source/logging.md
+7-3Lines changed: 7 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -63,7 +63,8 @@ Here's a brief explanation for the logged metrics provided in the data for the G
63
63
64
64
*`num_tokens`: Total number of input tokens processed during training so far.
65
65
66
-
**Completions:**
66
+
#### Completions
67
+
67
68
*`completions/mean_length`: Mean length of all generated completions (including those not ending with an EOS token).
68
69
*`completions/min_length`: Minimum length among all generated completions.
69
70
*`completions/max_length`: Maximum length among all generated completions.
@@ -72,13 +73,15 @@ Here's a brief explanation for the logged metrics provided in the data for the G
72
73
*`completions/min_terminated_length`: Minimum length among completions that ended with an EOS token.
73
74
*`completions/max_terminated_length`: Maximum length among completions that ended with an EOS token.
74
75
75
-
**Rewards:**
76
+
#### Rewards
77
+
76
78
*`rewards/{reward_func_name}/mean`: The mean reward obtained from a specific, named reward function (e.g., `rewards/my_custom_reward/mean`). This is logged for each reward function used.
77
79
*`rewards/{reward_func_name}/std`: The standard deviation of rewards from a specific, named reward function.
78
80
*`reward`: The overall mean of the (potentially weighted and, if `args.scale_rewards` is true, normalized) rewards, after group-wise normalization (advantages).
79
81
*`reward_std`: The standard deviation of the (potentially weighted) rewards *before* group-wise normalization for advantages.
80
82
81
-
**Policy and Loss Metrics:**
83
+
#### Policy and Loss Metrics
84
+
82
85
*`kl`: The mean Kullback-Leibler (KL) divergence between the current policy and the reference policy. This is logged only if `beta` (the KL coefficient in `GRPOConfig`) is non-zero.
83
86
*`entropy`: Average entropy of token predictions across generated completions.
84
87
* If Liger GRPOLoss is used (`use_liger_loss: True` in `GRPOConfig`):
@@ -91,6 +94,7 @@ Here's a brief explanation for the logged metrics provided in the data for the G
91
94
*`clip_ratio/region_mean`: The mean fraction of instances where the probability ratio was clipped at either the lower or upper bound.
92
95
93
96
### Crucial GRPO values
97
+
94
98
During GRPO training, monitor these values for insights into performance and stability:
95
99
96
100
1.`reward`: This is the primary objective. It reflects the (group-wise normalized) rewards the policy is achieving. It should generally increase during successful training.
The authors of this paper find that the combination of:
69
+
70
+
1. scaling rewards by the standard deviation computed over the entire batch and
71
+
2. aggregating loss over the total number of tokens
72
+
73
+
can unlock the learning capability of critic-free policies using vanilla PPO loss. Their results demonstrate that this simple combination consistently improves performance, surpassing strategies like GRPO and [DAPO](https://huggingface.co/papers/2503.14476).
74
+
75
+
TRL supports using these learnings to train a GRPO model by:
76
+
77
+
```python
78
+
from trl import GRPOConfig
79
+
80
+
training_args = GRPOConfig(
81
+
...
82
+
scale_rewards="group",
83
+
loss_type="bnpo",
84
+
# Other parameters used
85
+
beta=0.0, # = init_kl_coef in the paper
86
+
top_p=0.99,
87
+
top_k=100,
88
+
temperature=0.99,
89
+
num_completions=8, # = num_return_sequences in the paper
Note that when using gradient accumulation, the loss is aggregated over the total number of tokens in the batch, but not over the accumulated batch. For more details, see the [GRPO Trainer - Loss types](grpo_trainer#loss_types).
0 commit comments