Fix GRPO with replay buffer by inserting images in the prompt #4391

albertvillanova · 2025-10-30T12:19:49Z

Fix GRPO with replay buffer by inserting images in the prompt. Additionally, fix the CI test test_training_with_replay_buffer.

Follow-up to:

🧺 [5/N] Refactor _generate in GRPO/RLOO: Insert images in the prompt #4155

Currently, GRPO with Replay Buffer raises an error: https://github.com/huggingface/trl/actions/runs/18940392458/job/54077463859

TypeError: GRPOTrainer._generate() takes 2 positional arguments but 3 were given

Stacktrace:

>       trainer.train()

tests/experimental/test_grpo_with_replay_buffer_trainer.py:282: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.venv/lib/python3.13/site-packages/transformers/trainer.py:2325: in train
    return inner_training_loop(
.venv/lib/python3.13/site-packages/transformers/trainer.py:2674: in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.13/site-packages/transformers/trainer.py:4014: in training_step
    inputs = self._prepare_inputs(inputs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
trl/extras/profiling.py:98: in wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
trl/trainer/grpo_trainer.py:1037: in _prepare_inputs
    generation_batch = self._generate_and_score_completions(generation_batch)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <trl.experimental.grpo_with_replay_buffer.grpo_with_replay_buffer_trainer.GRPOWithReplayBufferTrainer object at 0x7fb2382151d0>
inputs = [{'prompt': "Although that way may not be obvious at first unless you're"}, {'prompt': "Although that way may not be o...may not be obvious at first unless you're"}, {'prompt': "Although that way may not be obvious at first unless you're"}]

    def _generate_and_score_completions(
        self, inputs: list[dict[str, Union[torch.Tensor, Any]]]
    ) -> dict[str, Union[torch.Tensor, Any]]:
        device = self.accelerator.device
        mode = "train" if self.model.training else "eval"
    
        prompts = [x["prompt"] for x in inputs]
    
        if "images" in inputs[0]:
            images = [example.get("images") for example in inputs]
        elif "image" in inputs[0]:
            images = [[example.get("image")] if example.get("image") is not None else None for example in inputs]
        else:
            images = None
        # Transformers requires at least one image in the batch, otherwise it throws an error
        if images is not None and all(img_list == [] for img_list in images):
            images = None
    
        (
            prompt_ids,
            completion_ids,
            prompt_mask,
            completion_mask,
            num_items_in_batch,
            sampling_per_token_logps,
            forward_kwargs,
>       ) = self._generate(prompts, images)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E       TypeError: GRPOTrainer._generate() takes 2 positional arguments but 3 were given

trl/experimental/grpo_with_replay_buffer/grpo_with_replay_buffer_trainer.py:91: TypeError

HuggingFaceDocBuilderDev · 2025-10-30T12:23:15Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

albertvillanova · 2025-10-30T13:56:02Z

trl/experimental/grpo_with_replay_buffer/grpo_with_replay_buffer_trainer.py

-        if std_rewards is None:
-            std_rewards = rewards.view(-1, self.num_generations).std(dim=1)
-            std_rewards = std_rewards.repeat_interleave(self.num_generations, dim=0)
-        std_rewards = std_rewards[process_slice] if std_rewards is not None else None


This last line is not in GRPOTrainer: is it necessary? If so, shouldn't we implement it in GRPOTrainer as well?

std_rewards = std_rewards[process_slice]

@pramodith if you've some time to check

Will take a look a bit later this evening!

We need the sliced std_rewards in this trainer because we decide if a specific example should be added to the replay buffer or sampled from the buffer based on std_reward of that specific rollout. Since each gpu sees a unique batch of data we need to only perform the buffer lookup and update based on the slice residing in the gpu.

GRPOTrainer doesn't need the std after advantage scores are computed so it can be discarded in GRPOTrainer.

This shows how std_rewards is used for updating the replay buffer

trl/trl/experimental/grpo_with_replay_buffer/grpo_with_replay_buffer_trainer.py

Lines 407 to 411 in 1c2322e

if groups_with_variance.any():

# Calculate replay buffer scores for groups with variance

replay_buffer_scores = (group_advantages.abs() * group_std_rewards).sum(dim=-1)[groups_with_variance]

# Add all groups to replay buffer at once (batch operation)

self.replay_buffer.add(replay_buffer_scores.tolist(), buffered_outputs)

This entire block of code removed should remain in grpo_with_replay_buffer_trainer.py we always need the group level std to determine what goes into the replay. buffer.

Thanks for the explanation about std_rewards = std_rewards[process_slice], @pramodith. 🤗

Respect to the block lines just before, I removed the condition if std_rewards is None because I think this is always False. Just some lines above, we have this code: https://github.com/albertvillanova/trl/blob/d0324230761e7860646f4d15d7ff8beb433103ac/trl/experimental/grpo_with_replay_buffer/grpo_with_replay_buffer_trainer.py#L250-L260

if self.scale_rewards in ["group", "none"]: # If self.scale_rewards = "none", we'll still log group level std std_rewards = rewards.view(-1, self.num_generations).std(dim=1) std_rewards = std_rewards.repeat_interleave(self.num_generations, dim=0) elif self.scale_rewards == "batch": # Compute global std std_rewards = rewards.std().expand_as(rewards) else: raise ValueError( f"Invalid value for scale_rewards: {self.scale_rewards}. Must be one of 'batch', 'group', or 'none'." )

Therefore, std_rewards can't be None if I understand correctly. It should always be a torch.Tensor.

Hmmm, yeah you're right but there's a bug here. The replay buffer requires the std to be computed over the group. I'll fix that in a subsequent PR, getting rid of that block in this PR is fine.

I re-added just that line: 4266550
😉

albertvillanova · 2025-10-31T15:45:34Z

The failing test will be fixed after the merge of:

Fix CI experimental tests TypeError for GRPOWithReplayBufferTrainer.update_with_replay_buffer #4366

qgallouedec

ok lgtm!

qgallouedec · 2025-10-31T16:06:01Z

Mmh wait we've

tests/experimental/test_grpo_with_replay_buffer_trainer.py::TestUpdateWithReplayBuffer::test_update_with_inputs_different_seq_len - TypeError: GRPOWithReplayBufferTrainer.update_with_replay_buffer() got an unexpected keyword argument 'prompt_inputs'. Did you mean 'prompt_ids'?

albertvillanova · 2025-10-31T16:15:27Z

That is fixed in my previous PR. See my comment above, @qgallouedec.

albertvillanova · 2025-10-31T16:20:30Z

@qgallouedec, there were 2 different bugs:

TypeError: GRPOWithReplayBufferTrainer.update_with_replay_buffer() got an unexpected keyword argument 'prompt_inputs'. Did you mean 'prompt_ids'?
- Fixed by:
  - Fix CI experimental tests TypeError for GRPOWithReplayBufferTrainer.update_with_replay_buffer #4366
TypeError: GRPOTrainer._generate() takes 2 positional arguments but 3 were given
- Fixed by this PR:
  - Fix GRPO with replay buffer by inserting images in the prompt #4391

qgallouedec · 2025-10-31T16:22:57Z

oh my bad, thanks for fixing it

qgallouedec · 2025-10-31T16:24:09Z

~~Let's merge #4366 to keep things clean~~

EDIT Let's merge #4366 first to keep things clean

Fix trainer in GRPO with replay buffer test

6ed80fb

Insert images in the prompt for GRPOWithReplayBufferTrainer

d032423

albertvillanova mentioned this pull request Oct 30, 2025

Fix CI experimental tests TypeError for GRPOWithReplayBufferTrainer.update_with_replay_buffer #4366

Merged

albertvillanova commented Oct 30, 2025

View reviewed changes

Slice std_rewards

4266550

Merge branch 'main' into fix-grpo-with-replay-buffer

148a31f

albertvillanova requested a review from qgallouedec October 31, 2025 15:45

remove commented code + fix num_images output

d71921b

qgallouedec approved these changes Oct 31, 2025

View reviewed changes

fix keyword argument

bd2ca7f

Merge branch 'main' into fix-grpo-with-replay-buffer

92fbd37

albertvillanova merged commit 5cefb39 into huggingface:main Oct 31, 2025
3 checks passed

	if groups_with_variance.any():
	# Calculate replay buffer scores for groups with variance
	replay_buffer_scores = (group_advantages.abs() * group_std_rewards).sum(dim=-1)[groups_with_variance]
	# Add all groups to replay buffer at once (batch operation)
	self.replay_buffer.add(replay_buffer_scores.tolist(), buffered_outputs)

Fix GRPO with replay buffer by inserting images in the prompt #4391

Fix GRPO with replay buffer by inserting images in the prompt #4391

Uh oh!

Conversation

albertvillanova commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Oct 30, 2025

Uh oh!

albertvillanova Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

qgallouedec Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

pramodith Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

pramodith Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pramodith Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

pramodith Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

albertvillanova Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pramodith Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

albertvillanova Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

albertvillanova commented Oct 31, 2025

Uh oh!

qgallouedec left a comment

Choose a reason for hiding this comment

Uh oh!

qgallouedec commented Oct 31, 2025

Uh oh!

albertvillanova commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

albertvillanova commented Oct 31, 2025

Uh oh!

qgallouedec commented Oct 31, 2025

Uh oh!

qgallouedec commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

albertvillanova commented Oct 30, 2025 •

edited

Loading

pramodith Oct 30, 2025 •

edited

Loading

albertvillanova Oct 30, 2025 •

edited

Loading

albertvillanova commented Oct 31, 2025 •

edited

Loading

qgallouedec commented Oct 31, 2025 •

edited

Loading