[V1][Model] Add V1 support for Qwen2-VL #11668

imkero · 2025-01-01T06:31:45Z

What's changed:

Allow using a function to determine dynamic dimensions of a tensor while torch.compile (M-RoPE uses a 2d position tensor which differs from common RoPE, and they share same impl in Qwen2 LM's forward fn)
Modify dummy data retrival in profile_run for Qwen2-VL launch
Add M-RoPE support to V1 gpu_model_runner
Add support of encoder output in tuple (embeddings: torch.Tensor, modality: str) in gpu_model_runner for Qwen2-VL
Use token_id instead of token str of image_token and video_token in Qwen2-VL's preprocessing for better performance

This PR should make Qwen2-VL works in V1 with chunked prefill and prefix caching enabled.

Signed-off-by: imkero <kerorek@outlook.com>

github-actions · 2025-01-01T06:31:57Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

DarkLight1337 · 2025-01-01T06:49:22Z

vllm/model_executor/models/qwen2_vl.py

@@ -791,6 +791,7 @@ def _parse_video_data(


 class Qwen2VLMultiModalProcessor(BaseMultiModalProcessor):
+    _placeholder_map: Optional[dict[str, list[int]]] = None


I think we should initialize this in the init method to avoid confusing it with a static class variable.

Apart from this, the processor-related changes in the model file LGTM.

ywang96 · 2025-01-01T06:51:28Z

Hello @imkero! Much appreciated that you made this PR!

The reason why I haven't spent too much on Qwen2-VL is that I want to see if there's a way to move MRope inside model file for Qwen2-VL since it is so specific to this model.

You would also need to change the implementation of _process_image_input and _process_video_input for this model to make it work properly on V1 (the returned embeddings need to be a NestedTensor, with the first dimension matching the total number of multimodal data items involved in the batch for fine-grained scheduling).

Feel free to take changes from here into this PR.

DarkLight1337 · 2025-01-01T06:51:39Z

vllm/model_executor/models/qwen2_vl.py

+        if not self._placeholder_map:
+            # NOTE: Only Qwen2VLProcessor in transformers 4.47.0 has
+            # image_token and video_token registered
+            encode_fn = hf_processor.tokenizer.encode
+            self._placeholder_map = {
+                "image": encode_fn(hf_processor.image_token),
+                "video": encode_fn(hf_processor.video_token),
+            }
+        placeholder = self._placeholder_map
+


Also, we can set this at initialization time.

ywang96 · 2025-01-01T07:10:50Z

vllm/v1/worker/gpu_model_runner.py

+                    encoder_outputs.append((
+                        encoder_output[0]
+                        [start_idx:end_idx],  # embedding tensor
+                        encoder_output[1],  # modality


My thought is we don't necessarily need to have the modality key here.

We can leverage the fact that any two mm_position's from any modalities cannot possibily have overlaps, and now that

vllm/vllm/model_executor/models/utils.py

Lines 408 to 423 in 11d8a09

def merge_multimodal_embeddings(

input_ids: torch.Tensor,

inputs_embeds: torch.Tensor,

multimodal_embeddings: NestedTensors,

placeholder_token_id: Union[int, List[int]],

) -> torch.Tensor:

"""

Merge ``multimodal_embeddings`` into ``inputs_embeds`` by overwriting the

positions in ``inputs_embeds`` corresponding to placeholder tokens in

``input_ids``.

``placeholder_token_id`` can be a list of token ids (e.g, token ids

of img_start, img_break, and img_end tokens) when needed: This means

the order of these tokens in the ``input_ids`` MUST MATCH the order of

their embeddings in ``multimodal_embeddings`` since we need to

slice-merge instead of individually scattering.

can apply the embedding replacement based on a list of token ids (so we can simply have [self.config.image_token_id, self.config.video_token_id] here)

Therefore, all we need to do should be just sorting mm_position's and their correpsonding mm_inputs in the following code(which also needs to be modified to support video modality for Qwen2VL in this PR)

vllm/vllm/v1/request.py

Lines 51 to 59 in 11d8a09

# Multi-modal input metadata.

mm_positions = self.inputs.multi_modal_placeholders

if mm_positions:

# FIXME(woosuk): Support other modalities.

self.mm_positions = mm_positions.get("image", [])

else:

self.mm_positions = []

# Output of the mm input mapper (e.g., image tensors).

self.mm_inputs: List[MultiModalKwargs] = []

WDYT?

On a second thought - let me actually work on this design for llava-onevision too

imkero added 3 commits January 1, 2025 13:50

feat: add v1 support for Qwen2-VL

780ef9c

Signed-off-by: imkero <kerorek@outlook.com>

style: lint fix

db18c1b

Signed-off-by: imkero <kerorek@outlook.com>

chore: rename types for dynamic dim

452cec2

Signed-off-by: imkero <kerorek@outlook.com>

imkero requested review from WoosukKwon, robertgshaw2-neuralmagic, njhill, ywang96, comaniac and alexm-neuralmagic as code owners January 1, 2025 06:31

DarkLight1337 reviewed Jan 1, 2025

View reviewed changes

ywang96 reviewed Jan 1, 2025

View reviewed changes

ywang96 mentioned this pull request Jan 1, 2025

[RFC]: Multi-modality Support on vLLM #4194

Open

66 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[V1][Model] Add V1 support for Qwen2-VL #11668

[V1][Model] Add V1 support for Qwen2-VL #11668

imkero commented Jan 1, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Jan 1, 2025

DarkLight1337 Jan 1, 2025

DarkLight1337 Jan 1, 2025 •

edited

Loading

ywang96 commented Jan 1, 2025 •

edited

Loading

DarkLight1337 Jan 1, 2025

ywang96 Jan 1, 2025 •

edited

Loading

ywang96 Jan 1, 2025

		@@ -791,6 +791,7 @@ def _parse_video_data(


		class Qwen2VLMultiModalProcessor(BaseMultiModalProcessor):
		_placeholder_map: Optional[dict[str, list[int]]] = None

	def merge_multimodal_embeddings(
	input_ids: torch.Tensor,
	inputs_embeds: torch.Tensor,
	multimodal_embeddings: NestedTensors,
	placeholder_token_id: Union[int, List[int]],
	) -> torch.Tensor:
	"""
	Merge ``multimodal_embeddings`` into ``inputs_embeds`` by overwriting the
	positions in ``inputs_embeds`` corresponding to placeholder tokens in
	``input_ids``.

	``placeholder_token_id`` can be a list of token ids (e.g, token ids
	of img_start, img_break, and img_end tokens) when needed: This means
	the order of these tokens in the ``input_ids`` MUST MATCH the order of
	their embeddings in ``multimodal_embeddings`` since we need to
	slice-merge instead of individually scattering.

	# Multi-modal input metadata.
	mm_positions = self.inputs.multi_modal_placeholders
	if mm_positions:
	# FIXME(woosuk): Support other modalities.
	self.mm_positions = mm_positions.get("image", [])
	else:
	self.mm_positions = []
	# Output of the mm input mapper (e.g., image tensors).
	self.mm_inputs: List[MultiModalKwargs] = []

[V1][Model] Add V1 support for Qwen2-VL #11668

Are you sure you want to change the base?

[V1][Model] Add V1 support for Qwen2-VL #11668

Conversation

imkero commented Jan 1, 2025 • edited by github-actions bot Loading

github-actions bot commented Jan 1, 2025

DarkLight1337 Jan 1, 2025

Choose a reason for hiding this comment

DarkLight1337 Jan 1, 2025 • edited Loading

Choose a reason for hiding this comment

ywang96 commented Jan 1, 2025 • edited Loading

DarkLight1337 Jan 1, 2025

Choose a reason for hiding this comment

ywang96 Jan 1, 2025 • edited Loading

Choose a reason for hiding this comment

ywang96 Jan 1, 2025

Choose a reason for hiding this comment

imkero commented Jan 1, 2025 •

edited by github-actions bot

Loading

DarkLight1337 Jan 1, 2025 •

edited

Loading

ywang96 commented Jan 1, 2025 •

edited

Loading

ywang96 Jan 1, 2025 •

edited

Loading