🪶 [GRPO] PPO Lite: Scale rewards by Std of Batch #3935

pramodith · 2025-08-21T13:02:02Z

What does this PR do?

Updates the scale_rewards parameter to accept three different valid values none, group and batch to determine the level at which the std of rewards will be aggregated.

With this feature users can now train a model using the PPOLite technique outlined in https://arxiv.org/html/2508.08221v1, by setting scale_rewards = group and loss_type='bnpo' (for token level loss aggregration).

Fixes #3920

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Copilot

Pull Request Overview

This PR enhances the GRPO (Group Relative Policy Optimization) trainer by allowing flexible reward scaling strategies. It changes the scale_rewards parameter from a boolean to a string that accepts three values: "none", "group", and "batch", enabling different levels of standard deviation computation for reward normalization. This supports implementing the PPOLite technique from the referenced paper.

Updates scale_rewards parameter to accept string values instead of boolean
Implements batch-level reward scaling using global standard deviation across all processes
Adds validation for the new parameter values and comprehensive test coverage

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File	Description
trl/trainer/grpo_trainer.py	Updates reward scaling logic to support group vs batch-level standard deviation computation
trl/trainer/grpo_config.py	Changes `scale_rewards` parameter type and adds validation for valid values
tests/test_grpo_trainer.py	Adds new test for batch-level reward scaling and updates existing test
docs/source/paper_index.md	Documents the PPOLite technique configuration example

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

docs/source/paper_index.md

HuggingFaceDocBuilderDev · 2025-08-21T19:04:24Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

qgallouedec

very very nice!

docs/source/paper_index.md

myron · 2025-08-25T09:46:30Z

@pramodith thank you for adding!

in paper_index example, shouldn't the slale_rewards be 'batch' instead of 'group' ?

https://github.com/huggingface/trl/blob/main/docs/source/paper_index.md#part-i-tricks-or-traps-a-deep-dive-into-rl-for-llm-reasoning-lite-ppo

pramodith · 2025-08-25T22:39:20Z

@myron yes you're right! My bad, created a PR to fix that.

Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

pramodith added 3 commits August 21, 2025 12:00

Initial commit.

6cd33ef

Fix broken tests.

3f44567

Add PPO-Lite to paper index.

9a83dad

pramodith changed the title ~~Initial commit.~~ [GRPO]: Scale rewards by Std of Batch Aug 21, 2025

pramodith marked this pull request as ready for review August 21, 2025 13:43

pramodith added 2 commits August 21, 2025 13:55

update paper index.

41f7008

precommit

54e3891

pramodith requested a review from Copilot August 21, 2025 14:08

Copilot AI reviewed Aug 21, 2025

View reviewed changes

docs/source/paper_index.md Outdated Show resolved Hide resolved

pramodith and others added 16 commits August 21, 2025 14:23

Support backward compatibility.

97eca43

ensure backward compatibility.

1670d61

precommit

15521eb

fix flag

e3c532f

Make sure cli treats scale rewards as a string.

876ea42

clarifiy and fix doc

ca2120d

style

5a2ba32

Refactor scale_rewards validation and handling in GRPOTrainer

52f60d0

no need to gather

de716d8

style

7cf7d2f

fix config type

7442c55

streamline test

e0cb3b3

mention ppo lite in grpo_trainer doc

70efb00

style

0b39485

other params in paper index

259ec72

clarify logging

6214507

scale_rewards: str for cli

5021373

qgallouedec approved these changes Aug 21, 2025

View reviewed changes

qgallouedec reviewed Aug 21, 2025

View reviewed changes

docs/source/paper_index.md Outdated Show resolved Hide resolved

Update docs/source/paper_index.md

78618ef

qgallouedec changed the title ~~[GRPO]: Scale rewards by Std of Batch~~ 🪶 [GRPO] PPO Lite: Scale rewards by Std of Batch Aug 21, 2025

qgallouedec merged commit fe44806 into huggingface:main Aug 21, 2025
10 checks passed

qgallouedec mentioned this pull request Oct 30, 2025

Complete paper index #4407

Open

54 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🪶 [GRPO] PPO Lite: Scale rewards by Std of Batch #3935

🪶 [GRPO] PPO Lite: Scale rewards by Std of Batch #3935

Uh oh!

pramodith commented Aug 21, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Aug 21, 2025

Uh oh!

qgallouedec left a comment

Uh oh!

Uh oh!

Uh oh!

myron commented Aug 25, 2025

Uh oh!

pramodith commented Aug 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

🪶 [GRPO] PPO Lite: Scale rewards by Std of Batch #3935

🪶 [GRPO] PPO Lite: Scale rewards by Std of Batch #3935

Uh oh!

Conversation

pramodith commented Aug 21, 2025

What does this PR do?

Before submitting

Who can review?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Aug 21, 2025

Uh oh!

qgallouedec left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

myron commented Aug 25, 2025

Uh oh!

pramodith commented Aug 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants