Add Qwen3VLGRPOTrainer for Qwen3-VL GRPO training #4529
+150
−2
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds a
Qwen3VLGRPOTrainer,.Motivation
When I was training Qwen3-VL using
GRPOTrainer, I encountered the errorIndexError: metadata = video_metadata[index].While debugging, I found that the original
GRPOTrainerclass, when given video inputs, produced messages where video fields andfpsvalues appeared inside"text"-type chunks withfps=Noneandvideo=None, which caused the bug.To fix this, I created a new class that normalizes the messages before generation and prevents this error.
What this PR changes
Adds a new class:
Qwen3VLVideoGRPOTrainer(GRPOTrainer)ingrpo_trainer.py._generate_single_turnonly; all GRPO/DAPO logic (rewards, advantages, clipping, logging, etc.) remains unchanged.{"type": "video", "video": ..., "fps": ...}{"type": "text", "text": ...}self.processing_class.apply_chat_template(...)directly on the Qwen3-VL conversation.model.generate(..., generation_config=self.generation_config).prompt_idsandcompletion_idsusing the prompt length, and returns them in the format expected by the GRPO pipeline.transformers.generatepath and explicitly errors ifuse_vllm=Trueoruse_transformers_paged=True, to keep behavior simple and predictable for this first iteration.Exports
Qwen3VLGRPOTrainer:src/trl/trainers/__init__.py.src/trl/__init__.py.This design keeps the change isolated and backward compatible: existing users of
GRPOTrainerare unaffected, while Qwen3-VL users can opt into the specialized trainer.Usage
Example (simplified):