generated from fastai/nbdev_template
-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Fix GRPO with replay buffer by inserting images in the prompt #4391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
albertvillanova
merged 7 commits into
huggingface:main
from
albertvillanova:fix-grpo-with-replay-buffer
Oct 31, 2025
Merged
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
6ed80fb
Fix trainer in GRPO with replay buffer test
albertvillanova d032423
Insert images in the prompt for GRPOWithReplayBufferTrainer
albertvillanova 4266550
Slice std_rewards
albertvillanova 148a31f
Merge branch 'main' into fix-grpo-with-replay-buffer
albertvillanova d71921b
remove commented code + fix num_images output
qgallouedec bd2ca7f
fix keyword argument
qgallouedec 92fbd37
Merge branch 'main' into fix-grpo-with-replay-buffer
albertvillanova File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This last line is not in GRPOTrainer: is it necessary? If so, shouldn't we implement it in GRPOTrainer as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pramodith if you've some time to check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will take a look a bit later this evening!
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need the sliced
std_rewardsin this trainer because we decide if a specific example should be added to the replay buffer or sampled from the buffer based onstd_rewardof that specific rollout. Since each gpu sees a unique batch of data we need to only perform the buffer lookup and update based on the slice residing in the gpu.GRPOTrainerdoesn't need the std after advantage scores are computed so it can be discarded in GRPOTrainer.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shows how
std_rewardsis used for updating the replay buffertrl/trl/experimental/grpo_with_replay_buffer/grpo_with_replay_buffer_trainer.py
Lines 407 to 411 in 1c2322e
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This entire block of code removed should remain in
grpo_with_replay_buffer_trainer.pywe always need the group level std to determine what goes into the replay. buffer.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation about
std_rewards = std_rewards[process_slice], @pramodith. 🤗Respect to the block lines just before, I removed the condition
if std_rewards is Nonebecause I think this is always False. Just some lines above, we have this code: https://github.com/albertvillanova/trl/blob/d0324230761e7860646f4d15d7ff8beb433103ac/trl/experimental/grpo_with_replay_buffer/grpo_with_replay_buffer_trainer.py#L250-L260Therefore,
std_rewardscan't be None if I understand correctly. It should always be atorch.Tensor.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, yeah you're right but there's a bug here. The replay buffer requires the
stdto be computed over the group. I'll fix that in a subsequent PR, getting rid of that block in this PR is fine.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I re-added just that line: 4266550
😉