🪶 [GRPO] PPO Lite: Scale rewards by Std of Batch (#3935)

pramodith · qgallouedec · web-flow · commit fe44806b6878 · 2025-08-21T12:47:07.000-07:00
Co-authored-by: Quentin Gallouédec &lt;gallouedec.quentin@gmail.com&gt;
Co-authored-by: Quentin Gallouédec &lt;45557362+qgallouedec@users.noreply.github.com&gt;
diff --git a/docs/source/grpo_trainer.md b/docs/source/grpo_trainer.md
@@ -80,6 +80,12 @@ It was shown in the paper [Understanding R1-Zero-Like Training: A Critical Persp
 
 </Tip>
 
+<Tip>
+
+[Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning (Lite PPO)](https://huggingface.co/papers/2508.08221) showed that calculating the mean at the local (group) level and the standard deviation at the global (batch) level enables more robust reward shaping. You can use this scaling strategy by setting `scale_rewards="batch"` in [`GRPOConfig`].
+
+</Tip>
+
 ### Estimating the KL divergence
 
 KL divergence is estimated using the approximator introduced by [Schulman et al. (2020)](http://joschu.net/blog/kl-approx.html). The approximator is defined as follows:
@@ -138,6 +144,7 @@ $$
 \mathcal{L}_{\text{DAPO}}(\theta) = - \frac{1}{\sum_{i=1}^G |o_i|} \sum_{i=1}^G \sum_{t=1}^{|o_i|} l_{i,t},
 $$
 
+To use this formulation, set `loss_type="bnpo"` in [`GRPOConfig`]. Note that we do not reproduce the DAPO formulation exactly: when using gradient accumulation, the loss is computed over the total number of tokens in each batch, not over the accumulated batches. `loss_type="bnpo"` is equivalent to the original DAPO formulation only when `gradient_accumulation_steps=1`.
 
 Furthermore, it was demonstrated in the paper [Understanding R1-Zero-Like Training: A Critical Perspective](https://huggingface.co/papers/2503.20783) that the initial GRPO formulation introduces a response length bias. They show that while the DAPO formulation reduces this bias, it does not eliminate it completely. To fully remove this bias, they propose dividing by a constant instead of the sequence length, resulting in the following formulation:
 
@@ -162,7 +169,9 @@ While training and evaluating, we record the following reward metrics:
 - `reward/{reward_func_name}/mean`: The average reward from a specific reward function.
 - `reward/{reward_func_name}/std`: The standard deviation of the reward from a specific reward function.
 - `reward`: The overall average reward after applying reward weights.
-- `reward_std`: The standard deviation of the overall reward within each batch after applying reward weights.
+- `reward_std`: The standard deviation of rewards after applying reward weights.  
+  - If `scale_rewards` is `"group"` or `"none"`, this is the average of the per-group standard deviations.
+  - If `scale_rewards` is `"batch"`, this is the standard deviation computed over all rewards in the batch (ignoring groups).
 - `frac_reward_zero_std`: The fraction of samples in the generation batch with a reward std of zero, implying there is little diversity for that prompt (all answers are correct or incorrect).
 - `entropy`: Average entropy of token predictions across generated completions. (If `mask_truncated_completions=True`, masked sequences tokens are excluded.)
 - `kl`: The average KL divergence between the model and the reference model, calculated over generated completions. Logged only if `beta` is nonzero.
diff --git a/docs/source/logging.md b/docs/source/logging.md
@@ -63,7 +63,8 @@ Here's a brief explanation for the logged metrics provided in the data for the G
 
 * `num_tokens`: Total number of input tokens processed during training so far.
 
-**Completions:**
+#### Completions
+
 * `completions/mean_length`: Mean length of all generated completions (including those not ending with an EOS token).
 * `completions/min_length`: Minimum length among all generated completions.
 * `completions/max_length`: Maximum length among all generated completions.
@@ -72,13 +73,15 @@ Here's a brief explanation for the logged metrics provided in the data for the G
 * `completions/min_terminated_length`: Minimum length among completions that ended with an EOS token.
 * `completions/max_terminated_length`: Maximum length among completions that ended with an EOS token.
 
-**Rewards:**
+#### Rewards
+
 * `rewards/{reward_func_name}/mean`: The mean reward obtained from a specific, named reward function (e.g., `rewards/my_custom_reward/mean`). This is logged for each reward function used.
 * `rewards/{reward_func_name}/std`: The standard deviation of rewards from a specific, named reward function.
 * `reward`: The overall mean of the (potentially weighted and, if `args.scale_rewards` is true, normalized) rewards, after group-wise normalization (advantages).
 * `reward_std`: The standard deviation of the (potentially weighted) rewards *before* group-wise normalization for advantages.
 
-**Policy and Loss Metrics:**
+#### Policy and Loss Metrics
+
 * `kl`: The mean Kullback-Leibler (KL) divergence between the current policy and the reference policy. This is logged only if `beta` (the KL coefficient in `GRPOConfig`) is non-zero.
 * `entropy`: Average entropy of token predictions across generated completions.
 * If Liger GRPOLoss is used (`use_liger_loss: True` in `GRPOConfig`):
@@ -91,6 +94,7 @@ Here's a brief explanation for the logged metrics provided in the data for the G
     *   `clip_ratio/region_mean`: The mean fraction of instances where the probability ratio was clipped at either the lower or upper bound.
 
 ### Crucial GRPO values
+
 During GRPO training, monitor these values for insights into performance and stability:
 
 1.  `reward`: This is the primary objective. It reflects the (group-wise normalized) rewards the policy is achieving. It should generally increase during successful training.
diff --git a/docs/source/paper_index.md b/docs/source/paper_index.md
@@ -60,3 +60,38 @@ trainer = SFTTrainer(
     callbacks=[BEMACallback()],
 )
 ```
+
+## Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning (Lite PPO)
+
+**📜 Paper**: https://huggingface.co/papers/2508.08221
+
+The authors of this paper find that the combination of:
+
+1. scaling rewards by the standard deviation computed over the entire batch and
+2. aggregating loss over the total number of tokens
+
+can unlock the learning capability of critic-free policies using vanilla PPO loss. Their results demonstrate that this simple combination consistently improves performance, surpassing strategies like GRPO and [DAPO](https://huggingface.co/papers/2503.14476).
+
+TRL supports using these learnings to train a GRPO model by:
+
+```python
+from trl import GRPOConfig
+
+training_args = GRPOConfig(
+    ...
+    scale_rewards="group",
+    loss_type="bnpo",
+    # Other parameters used
+    beta=0.0,  # = init_kl_coef in the paper
+    top_p=0.99,
+    top_k=100,
+    temperature=0.99,
+    num_completions=8, # = num_return_sequences in the paper
+    num_iterations=1,  # = ppo_epochs in the paper
+    per_device_train_batch_size=4
+    gradient_accumulation_steps=32,
+    steps_per_generation=8,  # (rollout_batch_size*num_return_sequences) / (per_device_train_batch_size*gradient_accumulation_steps)
+)
+```
+
+Note that when using gradient accumulation, the loss is aggregated over the total number of tokens in the batch, but not over the accumulated batch. For more details, see the [GRPO Trainer - Loss types](grpo_trainer#loss_types).
diff --git a/tests/test_grpo_trainer.py b/tests/test_grpo_trainer.py
@@ -1267,7 +1267,8 @@ def test_training_vllm_with_additional_generation_kwargs(self):
             new_param = trainer.model.get_parameter(n)
             self.assertFalse(torch.equal(param, new_param), f"Parameter {n} has not changed.")
 
-    def test_training_no_scale_rewards(self):
+    @parameterized.expand([(False,), ("group",), ("batch",), (True,), ("none",)])
+    def test_training_scale_rewards(self, scale_rewards):
         dataset = load_dataset("trl-internal-testing/zen", "standard_prompt_only", split="train")
 
         training_args = GRPOConfig(
@@ -1276,7 +1277,7 @@ def test_training_no_scale_rewards(self):
             per_device_train_batch_size=3,  # reduce the batch size to reduce memory usage
             num_generations=3,  # reduce the number of generations to reduce memory usage
             max_completion_length=8,  # reduce the completion length to reduce memory usage
-            scale_rewards=False,
+            scale_rewards=scale_rewards,
             report_to="none",
         )
         trainer = GRPOTrainer(
diff --git a/trl/trainer/grpo_config.py b/trl/trainer/grpo_config.py
@@ -166,11 +166,16 @@ class GRPOConfig(TrainingArguments):
         reward_weights (`list[float]` or `None`, *optional*, defaults to `None`):
             Weights for each reward function. Must match the number of reward functions. If `None`, all rewards are
             weighted equally with weight `1.0`.
-        scale_rewards (`bool`, *optional*, defaults to `True`):
-            Whether to scale the rewards by dividing them by their standard deviation. If `True` (default), the rewards
-            are normalized by the standard deviation, ensuring they have unit variance. If `False`, no scaling is
-            applied. The [Dr. GRPO paper](https://huggingface.co/papers/2503.20783) recommends not scaling the rewards,
-            as scaling by the standard deviation introduces a question-level difficulty bias.
+        scale_rewards (`str` or `bool`, *optional*, defaults to `"group"`):
+            Specifies the scaling strategy for rewards. Supported values are:
+
+            - `True` or `"group"` (default): rewards are scaled by the standard deviation within each group, ensuring
+              unit variance within a group.
+            - `"batch"`: rewards are scaled by the standard deviation across the entire batch, as recommended in the
+              [PPO Lite paper](https://huggingface.co/papers/2508.08221).
+            - `False` or `"none"`: no scaling is applied. The [Dr. GRPO
+              paper](https://huggingface.co/papers/2503.20783) recommends not scaling rewards, as scaling by the
+              standard deviation introduces a question-level difficulty bias.
         loss_type (`str`, *optional*, defaults to `"bnpo"`):
             Specifies the loss formulation to use. Supported values are:
 
@@ -496,13 +501,16 @@ class GRPOConfig(TrainingArguments):
             "rewards are weighted equally with weight `1.0`."
         },
     )
-    scale_rewards: bool = field(
-        default=True,
+    scale_rewards: str = field(
+        default="group",
         metadata={
-            "help": "Whether to scale the rewards by dividing them by their standard deviation. If `True` (default), "
-            "the rewards are normalized by the standard deviation, ensuring they have unit variance. If `False`, no "
-            "scaling is applied. The Dr. GRPO paper recommends not scaling the rewards, as scaling by the standard "
-            "deviation introduces a question-level difficulty bias."
+            "help": "Specifies the scaling strategy for rewards. Supported values are: "
+            "`True` or `group'` (default): rewards are scaled by the standard deviation within each group, ensuring "
+            "unit variance within a group. "
+            "`'batch'`: rewards are scaled by the standard deviation across the entire batch, as recommended in the "
+            "PPO Lite paper. "
+            "`False` or `'none'`: no scaling is applied. The Dr. GRPO paper recommends not scaling rewards, as "
+            "scaling by the standard deviation introduces a question-level difficulty bias."
         },
     )
     loss_type: str = field(
diff --git a/trl/trainer/grpo_trainer.py b/trl/trainer/grpo_trainer.py
@@ -644,7 +644,7 @@ def __init__(
         self.vllm_tensor_parallel_size = args.vllm_tensor_parallel_size  # only applies to colocation mode
         self.use_liger_loss = args.use_liger_loss
         self.loss_type = args.loss_type
-        self.scale_rewards = args.scale_rewards
+        self.scale_rewards = {True: "group", False: "none"}.get(args.scale_rewards, args.scale_rewards)
         self.importance_sampling_level = args.importance_sampling_level
         self.mask_truncated_completions = args.mask_truncated_completions
         self.top_entropy_quantile = args.top_entropy_quantile
@@ -1677,15 +1677,26 @@ def _generate_and_score_completions(
 
         # Compute grouped-wise rewards
         mean_grouped_rewards = rewards.view(-1, self.num_generations).mean(dim=1)
-        std_grouped_rewards = rewards.view(-1, self.num_generations).std(dim=1)
-        is_std_zero = torch.isclose(std_grouped_rewards, torch.zeros_like(std_grouped_rewards))
 
         # Normalize the rewards to compute the advantages
         mean_grouped_rewards = mean_grouped_rewards.repeat_interleave(self.num_generations, dim=0)
-        std_grouped_rewards = std_grouped_rewards.repeat_interleave(self.num_generations, dim=0)
         advantages = rewards - mean_grouped_rewards
-        if self.scale_rewards:
-            advantages = advantages / (std_grouped_rewards + 1e-4)
+
+        if self.scale_rewards in ["batch", "none"]:
+            # If self.scale_rewards = "none", we'll still log group level std
+            std_rewards = rewards.view(-1, self.num_generations).std(dim=1)
+            std_rewards = std_rewards.repeat_interleave(self.num_generations, dim=0)
+        elif self.scale_rewards == "group":
+            # Compute global std
+            std_rewards = rewards.std().expand_as(rewards)
+        else:
+            raise ValueError(
+                f"Invalid value for scale_rewards: {self.scale_rewards}. Must be one of 'batch', 'group', or 'none'."
+            )
+
+        is_std_zero = torch.isclose(std_rewards, torch.zeros_like(std_rewards))
+        if self.scale_rewards in ["batch", "none"]:
+            advantages = advantages / (std_rewards + 1e-4)
 
         # Slice to keep only the local part of the data
         process_slice = slice(
@@ -1721,10 +1732,10 @@ def _generate_and_score_completions(
         for i, reward_func_name in enumerate(self.reward_func_names):
             mean_rewards = torch.nanmean(rewards_per_func[:, i]).item()
             self._metrics[mode][f"rewards/{reward_func_name}/mean"].append(mean_rewards)
-            std_rewards = nanstd(rewards_per_func[:, i]).item()
-            self._metrics[mode][f"rewards/{reward_func_name}/std"].append(std_rewards)
+            std_func_rewards = nanstd(rewards_per_func[:, i]).item()
+            self._metrics[mode][f"rewards/{reward_func_name}/std"].append(std_func_rewards)
         self._metrics[mode]["reward"].append(mean_grouped_rewards.mean().item())
-        self._metrics[mode]["reward_std"].append(std_grouped_rewards.mean().item())
+        self._metrics[mode]["reward_std"].append(std_rewards.mean().item())
         self._metrics[mode]["frac_reward_zero_std"].append(is_std_zero.float().mean().item())
 
         # Log prompt and completion texts