Skip to content

Conversation

@pramodith
Copy link
Collaborator

What does this PR do?

Updates the scale_rewards parameter to accept three different valid values none, group and batch to determine the level at which the std of rewards will be aggregated.

With this feature users can now train a model using the PPOLite technique outlined in https://arxiv.org/html/2508.08221v1, by setting scale_rewards = group and loss_type='bnpo' (for token level loss aggregration).

Fixes #3920

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a GitHub issue? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@pramodith pramodith changed the title Initial commit. [GRPO]: Scale rewards by Std of Batch Aug 21, 2025
@pramodith pramodith marked this pull request as ready for review August 21, 2025 13:43
@pramodith pramodith requested a review from Copilot August 21, 2025 14:08
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances the GRPO (Group Relative Policy Optimization) trainer by allowing flexible reward scaling strategies. It changes the scale_rewards parameter from a boolean to a string that accepts three values: "none", "group", and "batch", enabling different levels of standard deviation computation for reward normalization. This supports implementing the PPOLite technique from the referenced paper.

  • Updates scale_rewards parameter to accept string values instead of boolean
  • Implements batch-level reward scaling using global standard deviation across all processes
  • Adds validation for the new parameter values and comprehensive test coverage

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
trl/trainer/grpo_trainer.py Updates reward scaling logic to support group vs batch-level standard deviation computation
trl/trainer/grpo_config.py Changes scale_rewards parameter type and adds validation for valid values
tests/test_grpo_trainer.py Adds new test for batch-level reward scaling and updates existing test
docs/source/paper_index.md Documents the PPOLite technique configuration example

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@qgallouedec qgallouedec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very very nice!

@qgallouedec qgallouedec changed the title [GRPO]: Scale rewards by Std of Batch 🪶 [GRPO] PPO Lite: Scale rewards by Std of Batch Aug 21, 2025
@qgallouedec qgallouedec merged commit fe44806 into huggingface:main Aug 21, 2025
10 checks passed
@myron
Copy link

myron commented Aug 25, 2025

@pramodith thank you for adding!

in paper_index example, shouldn't the slale_rewards be 'batch' instead of 'group' ?

https://github.com/huggingface/trl/blob/main/docs/source/paper_index.md#part-i-tricks-or-traps-a-deep-dive-into-rl-for-llm-reasoning-lite-ppo

@pramodith
Copy link
Collaborator Author

@myron yes you're right! My bad, created a PR to fix that.

LuisVasquezBSC pushed a commit to langtech-bsc/trl that referenced this pull request Aug 28, 2025
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
LuisVasquezBSC pushed a commit to langtech-bsc/trl that referenced this pull request Aug 28, 2025
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
SamY724 pushed a commit to SamY724/trl that referenced this pull request Sep 6, 2025
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
@qgallouedec qgallouedec mentioned this pull request Oct 30, 2025
54 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Does the trl library support the Lite PPO algorithm?

4 participants