🧺 [1/N] Refactor `_generate` in GRPO/RLOO: list of ints instead of tensors #4146

qgallouedec · 2025-09-26T02:49:46Z

This PR belongs to a sequence of PR that aims to refactor the generation part of GRPO/RLOO to allow for easier customization and ultimately tool calling

🧺 [2/N] Refactor _generate in GRPO/RLOO: Use prompt_ids from generation #4152
🧺 [3/N] Refactor _generate in GRPO/RLOO: Rely on generator for prompt truncation #4153
🧺 [4/N] Refactor _generate in GRPO/RLOO: Move forward_kwargs outside generation method #4154
🧺 [5/N] Refactor _generate in GRPO/RLOO: Insert images in the prompt #4155

The idea with this PR is to make _generate return list of ints instead of tensors. This will help a lots when implementing tool calling.

Several modifications:

truncate_with_protected_tokens: instead of operating on 2D tensors (ids and mask), it will operate on sequence ids directly:

before

>>> ids = torch.tensor([[1, 2, 3], [4, 5, 6]])
>>> mask = torch.tensor([[1, 1, 1], [0, 0, 1]])
>>> truncate_with_protected_tokens(ids, mask, target_length=2, protected_tokens=[])
tensor([[2, 3], [5, 6]]), tensor([[1, 1], [0, 1]])

after

>>> ids = [[1, 2, 3], [4, 5, 6]]
>>> [truncate_with_protected_tokens(seq, target_length=2, protected_tokens=[]) for seq in ids]
[[2, 3], [5, 6]]

_generate now returns list of ids, instead of tensor + mask
conversion to tensor is handle in _generate_and_score_completions
The generation part is moved to a function _generate_single_turn, which is called by _generate

…_thw` in GRPO and RLOO trainers; update `split_pixel_values_by_grid` to use `image_grid_thw`

qgallouedec · 2025-09-26T16:09:17Z

trl/trainer/grpo_trainer.py

-            **kwargs,
-        )
-        prompt_inputs = super()._prepare_inputs(prompt_inputs)
+        prompt_inputs = self.processing_class(text=prompts_text, add_special_tokens=False, **kwargs)


the function must now return a list of ints, so we must remove padding

qgallouedec · 2025-09-26T16:24:26Z

trl/trainer/grpo_trainer.py

-            prompt_mask,
-            completion_mask,


prompt and completion masks are later inferred from the sequence lengths

albertvillanova

Thanks.

albertvillanova · 2025-09-29T09:28:34Z

trl/trainer/utils.py

+        sequences (`list[int]`):
+            Input sequence of token IDs.


The sequences name in the docstring is not aligned with the ids name in the signature.

Additionally, before it accepted batch_size sequences (within the tensor) and now it accepts a single sequence (list[int]). Isn't this breaking something? Some tests should be failing because of the new behavior?

Some tests should be failing because of the new behavior?

yes, tests have been updated as well, see TruncateWithProtectedTokensTester

The sequences name in the docstring is not aligned with the ids name in the signature.

thanks! fixed in c570fb0

trl/trainer/utils.py

Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>

qgallouedec and others added 30 commits September 19, 2025 20:57

Refactor image handling: replace image_split_sizes with `image_grid…

552e899

…_thw` in GRPO and RLOO trainers; update `split_pixel_values_by_grid` to use `image_grid_thw`

simpler

449ef07

gfpo

c8933aa

multi-image grpo

229c554

log with wandb

3ca6ad5

no vlm reward models

dcf4b92

rloo

30ad7ca

gfpo

86cc30b

fix

088897b

test peft

d2adc63

fix gfpo

f4c82bf

rloo test

1257796

peft rloo

099a39b

oops

529add6

update test

fc6b11f

generate method

ae1f497

debug

f998432

skip failing test

fa73876

Merge branch 'main' into drop-image_split_sizes

52d8bd9

Merge branch 'drop-image_split_sizes' into multi-image-support

dfc0d38

test fixed!

fc52e68

Merge branch 'multi-image-support' into generate-method

4d12aeb

gfpo

4fc2b5b

rm vllm

b628744

fix doc

d3a769f

Merge branch 'main' into drop-image_split_sizes

e17ec42

Merge branch 'drop-image_split_sizes' into multi-image-support

efbb03a

Merge branch 'main' into multi-image-support

562c662

Merge branch 'main' into multi-image-support

485781c

update layers to ignore

05270f8

qgallouedec commented Sep 26, 2025

View reviewed changes

qgallouedec added 3 commits September 26, 2025 16:12

consistent naming

8766fa5

better

236b78b

simplify a bit + comment

9da4830

qgallouedec commented Sep 26, 2025

View reviewed changes

qgallouedec added 2 commits September 26, 2025 18:05

another one

b3bd0b0

remove pad token removal

8d34d54

qgallouedec changed the title ~~🧺 Refactor _generate in GRPO/RLOO~~ 🧺 [1/N] Refactor _generate in GRPO/RLOO Sep 26, 2025

qgallouedec changed the title ~~🧺 [1/N] Refactor _generate in GRPO/RLOO~~ 🧺 [1/N] Refactor _generate in GRPO/RLOO: list of ints instead of tensors Sep 26, 2025

qgallouedec and others added 4 commits September 26, 2025 23:46

rloo + doc

55a2480

gfpo

c5064d6

Merge branch 'main' into refactor_generate

effb41b

Merge branch 'main' into refactor_generate

e82bfb4

albertvillanova approved these changes Sep 29, 2025

View reviewed changes

qgallouedec and others added 7 commits September 30, 2025 11:46

Merge branch 'main' into refactor_generate

3a0ba92

Apply suggestion from @albertvillanova

c5fa2df

Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>

fix docstring

c570fb0

Apply suggestion from @albertvillanova

2f70440

Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>

style

80b7403

Merge branch 'main' into refactor_generate

84f400c

Merge branch 'main' into refactor_generate

c72f54a

qgallouedec merged commit ea66a9e into main Sep 30, 2025
11 of 12 checks passed

qgallouedec deleted the refactor_generate branch September 30, 2025 22:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🧺 [1/N] Refactor `_generate` in GRPO/RLOO: list of ints instead of tensors #4146

🧺 [1/N] Refactor `_generate` in GRPO/RLOO: list of ints instead of tensors #4146

Uh oh!

qgallouedec commented Sep 26, 2025 •

edited

Loading

Uh oh!

qgallouedec Sep 26, 2025

Uh oh!

qgallouedec Sep 26, 2025

Uh oh!

albertvillanova left a comment

Uh oh!

albertvillanova Sep 29, 2025

Uh oh!

albertvillanova Sep 29, 2025

Uh oh!

qgallouedec Sep 30, 2025

Uh oh!

qgallouedec Sep 30, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

🧺 [1/N] Refactor _generate in GRPO/RLOO: list of ints instead of tensors #4146

🧺 [1/N] Refactor _generate in GRPO/RLOO: list of ints instead of tensors #4146

Uh oh!

Conversation

qgallouedec commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qgallouedec Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

qgallouedec Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

albertvillanova left a comment

Choose a reason for hiding this comment

Uh oh!

albertvillanova Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

albertvillanova Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

qgallouedec Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

qgallouedec Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

🧺 [1/N] Refactor `_generate` in GRPO/RLOO: list of ints instead of tensors #4146

🧺 [1/N] Refactor `_generate` in GRPO/RLOO: list of ints instead of tensors #4146

qgallouedec commented Sep 26, 2025 •

edited

Loading