-
Notifications
You must be signed in to change notification settings - Fork 2.3k
🪶 [GRPO] PPO Lite: Scale rewards by Std of Batch #3935
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🪶 [GRPO] PPO Lite: Scale rewards by Std of Batch #3935
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances the GRPO (Group Relative Policy Optimization) trainer by allowing flexible reward scaling strategies. It changes the scale_rewards parameter from a boolean to a string that accepts three values: "none", "group", and "batch", enabling different levels of standard deviation computation for reward normalization. This supports implementing the PPOLite technique from the referenced paper.
- Updates
scale_rewardsparameter to accept string values instead of boolean - Implements batch-level reward scaling using global standard deviation across all processes
- Adds validation for the new parameter values and comprehensive test coverage
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| trl/trainer/grpo_trainer.py | Updates reward scaling logic to support group vs batch-level standard deviation computation |
| trl/trainer/grpo_config.py | Changes scale_rewards parameter type and adds validation for valid values |
| tests/test_grpo_trainer.py | Adds new test for batch-level reward scaling and updates existing test |
| docs/source/paper_index.md | Documents the PPOLite technique configuration example |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
qgallouedec
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very very nice!
|
@pramodith thank you for adding! in paper_index example, shouldn't the slale_rewards be 'batch' instead of 'group' ? |
|
@myron yes you're right! My bad, created a PR to fix that. |
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
What does this PR do?
Updates the
scale_rewardsparameter to accept three different valid valuesnone,groupandbatchto determine the level at which the std of rewards will be aggregated.With this feature users can now train a model using the PPOLite technique outlined in https://arxiv.org/html/2508.08221v1, by setting
scale_rewards=groupandloss_type='bnpo'(for token level loss aggregration).Fixes #3920
Before submitting
Pull Request section?
to it if that's the case.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.