Remove index_put from MM embeddings merging #22105

chenxi-yang · 2025-08-01T20:00:37Z

Summary:
Previously, merge_multimodal_embeddings used inputs_embeds[is_multimodal]=flattened to merge the MM embeddings. This index_put operations calls non_zero in pytorch, which forces an additional D2H shape check sync point.
This diff uses masked_scatter to bypass the D2H point. The latency changes from 35ms to 0.08ms.

Test Plan:
E2E Results:
Before:

QPS:                 0.91
Avg latency:         8.542s
Avg TTFT (client):   4186.02ms
P50 TTFT (client):   4589.01ms
P99 TTFT (client):   6897.92ms
Avg TTIT (client):   21.78ms
P50 TTIT (client):   19.99ms
P99 TTIT (client):   41.25ms
Avg TTFT (server):   5657.19ms
Avg TTIT (server):   130.71ms
Avg prefill len:     22284.20 tokens
P50 prefill len:     22284.00 tokens
P99 prefill len:     22291.00 tokens
Avg decode len:      200.00 tokens
P50 decode len:      200.00 tokens
P99 decode len:      200.00 tokens

After:

QPS:                 0.94
Avg latency:         8.456s
Avg TTFT (client):   4089.09ms
P50 TTFT (client):   4510.47ms
P99 TTFT (client):   6902.79ms
Avg TTIT (client):   21.83ms
P50 TTIT (client):   19.89ms
P99 TTIT (client):   41.02ms
Avg TTFT (server):   3808.06ms
Avg TTIT (server):   79.76ms
Avg prefill len:     22284.26 tokens
P50 prefill len:     22284.00 tokens
P99 prefill len:     22291.00 tokens
Avg decode len:      200.00 tokens
P50 decode len:      200.00 tokens
P99 decode len:      200.00 tokens

Rollback Plan:

Differential Revision: D79405697

github-actions · 2025-08-01T20:00:45Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

facebook-github-bot · 2025-08-01T20:01:03Z

This pull request was exported from Phabricator. Differential Revision: D79405697

gemini-code-assist

Code Review

This pull request introduces a significant performance optimization by replacing index_put with masked_scatter_ for merging multimodal embeddings, which avoids a D2H sync point. The latency improvement is impressive.

I've identified two issues:

In vllm/model_executor/models/utils.py, the exception handling for the new masked_scatter_ operation can obscure the original error traceback, making debugging more difficult. I've suggested a change to preserve the original exception.
There's an unrelated change in vllm/v1/worker/gpu_model_runner.py that modifies how token IDs are copied from GPU to CPU. This should be moved to a separate pull request for clarity and proper review.

Once these points are addressed, the main change looks good to go.

gemini-code-assist · 2025-08-01T20:01:37Z

vllm/v1/worker/gpu_model_runner.py

This change appears to be unrelated to the PR's title and description, which are about optimizing multimodal embeddings merging. This change modifies the GPU-to-CPU data transfer for sampled_token_ids.

Mixing unrelated changes in a single PR makes the review process difficult and the commit history hard to follow. Please revert this change and submit it as a separate pull request with its own title, description, and justification.

Additionally, the new implementation using torch.cuda.Event and synchronize() appears to be a synchronous operation, similar to the original .tolist(). The performance benefits are not immediately obvious and would need to be explained and benchmarked in its own PR.

Suggested change

pinned = torch.empty_like(sampled_token_ids, device='cpu', pin_memory=True)

transfer_event = torch.cuda.Event()

pinned.copy_(sampled_token_ids, non_blocking=True)

transfer_event.record()

transfer_event.synchronize()

valid_sampled_token_ids = pinned.tolist()

valid_sampled_token_ids = sampled_token_ids.tolist()

gemini-code-assist · 2025-08-01T20:01:37Z

vllm/model_executor/models/utils.py

Raising a new ValueError here obscures the original traceback from the RuntimeError. This can make it more difficult to debug issues that are not related to a shape mismatch but still cause a RuntimeError (e.g., type or device mismatches).

To preserve the full context of the error, it's better to re-raise the original exception or chain it.

Suggested change

else:

raise ValueError("Error during masked scatter operation:", e)

else:

# Re-raise the original exception to preserve the traceback

# for easier debugging of unexpected errors.

raise e

ywang96

Thanks for your contritbution and I left some comments! Can you do some end-to-end benchmark and show the overall improvement? Thanks!

ywang96 · 2025-08-01T20:54:42Z

vllm/model_executor/models/utils.py

Thanks for making this change!

I was also hoping PyTorch can address this pytorch/pytorch#57515, but let's make this change on our side.

@ywang96 With the upgrade to PyTorch 2.9 this should be handled correctly for index put with a boolean mask as long as the mask is on CPU which at least used to be the case here. See pytorch/pytorch#156384 for more info.

ywang96 · 2025-08-01T21:00:22Z

vllm/v1/worker/gpu_model_runner.py

This change is unrelated and I believe this is accidental - please remove

huachenheli · 2025-08-01T21:32:08Z

vllm/model_executor/models/utils.py

Can you also add this to llama4's private repo?

facebook-github-bot · 2025-08-01T22:34:45Z

This pull request was exported from Phabricator. Differential Revision: D79405697

facebook-github-bot · 2025-08-01T22:46:18Z

This pull request was exported from Phabricator. Differential Revision: D79405697

facebook-github-bot · 2025-08-01T22:57:12Z

This pull request was exported from Phabricator. Differential Revision: D79405697

facebook-github-bot · 2025-08-01T22:58:30Z

This pull request was exported from Phabricator. Differential Revision: D79405697

facebook-github-bot · 2025-08-01T22:59:56Z

This pull request was exported from Phabricator. Differential Revision: D79405697

facebook-github-bot · 2025-08-03T19:16:49Z

This pull request was exported from Phabricator. Differential Revision: D79405697

Summary: Pull Request resolved: vllm-project#22105 Previously, _merge_multimodal_embeddings used `inputs_embeds[is_multimodal]=flattened` to merge the MM embeddings. This index_put operations calls non_zero in pytorch, which forces an additional D2H shape check sync point. This diff uses masked_scatter_ to bypass the D2H point. The latency changes from 35ms to 0.08ms. Test Plan: E2E Results: Before: QPS: 0.91 Avg latency: 8.542s Avg TTFT (client): 4186.02ms P50 TTFT (client): 4589.01ms P99 TTFT (client): 6897.92ms Avg TTIT (client): 21.78ms P50 TTIT (client): 19.99ms P99 TTIT (client): 41.25ms Avg TTFT (server): 5657.19ms Avg TTIT (server): 130.71ms Avg prefill len: 22284.20 tokens P50 prefill len: 22284.00 tokens P99 prefill len: 22291.00 tokens Avg decode len: 200.00 tokens P50 decode len: 200.00 tokens P99 decode len: 200.00 tokens After: QPS: 0.94 Avg latency: 8.456s Avg TTFT (client): 4089.09ms P50 TTFT (client): 4510.47ms P99 TTFT (client): 6902.79ms Avg TTIT (client): 21.83ms P50 TTIT (client): 19.89ms P99 TTIT (client): 41.02ms Avg TTFT (server): 3808.06ms Avg TTIT (server): 79.76ms Avg prefill len: 22284.26 tokens P50 prefill len: 22284.00 tokens P99 prefill len: 22291.00 tokens Avg decode len: 200.00 tokens P50 decode len: 200.00 tokens P99 decode len: 200.00 tokens Rollback Plan: Differential Revision: D79405697

facebook-github-bot · 2025-08-03T19:24:09Z

This pull request was exported from Phabricator. Differential Revision: D79405697

Co-authored-by: Chenxi Yang <cxyang@meta.com>

Co-authored-by: Chenxi Yang <cxyang@meta.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>

Co-authored-by: Chenxi Yang <cxyang@meta.com> Signed-off-by: Noam Gat <noamgat@gmail.com>

Co-authored-by: Chenxi Yang <cxyang@meta.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>

Co-authored-by: Chenxi Yang <cxyang@meta.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>

Co-authored-by: Chenxi Yang <cxyang@meta.com>

chenxi-yang requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners August 1, 2025 20:00

mergify bot added the v1 label Aug 1, 2025

gemini-code-assist bot reviewed Aug 1, 2025

View reviewed changes

ywang96 reviewed Aug 1, 2025

View reviewed changes

huachenheli reviewed Aug 1, 2025

View reviewed changes

vllm/model_executor/models/utils.py Outdated

Copy link

Contributor

huachenheli Aug 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add this to llama4's private repo?

chenxi-yang force-pushed the export-D79405697 branch from 891f0df to d70d130 Compare August 1, 2025 22:30

chenxi-yang force-pushed the export-D79405697 branch from d70d130 to f9f57cd Compare August 1, 2025 22:34

chenxi-yang force-pushed the export-D79405697 branch from f9f57cd to 060b313 Compare August 1, 2025 22:46

chenxi-yang force-pushed the export-D79405697 branch from 060b313 to ad36b11 Compare August 1, 2025 22:56

chenxi-yang force-pushed the export-D79405697 branch from ad36b11 to afae798 Compare August 1, 2025 22:56

chenxi-yang force-pushed the export-D79405697 branch from afae798 to 5651164 Compare August 1, 2025 22:58

chenxi-yang force-pushed the export-D79405697 branch from 78054a0 to ddf4b3f Compare August 3, 2025 19:16

chenxi-yang force-pushed the export-D79405697 branch from ddf4b3f to e0343ac Compare August 3, 2025 19:24

DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 4, 2025

DarkLight1337 approved these changes Aug 4, 2025

View reviewed changes

DarkLight1337 enabled auto-merge (squash) August 4, 2025 02:29

vllm-bot merged commit e5949e5 into vllm-project:main Aug 4, 2025
50 of 53 checks passed

DarkLight1337 added this to Multi-modality Core Aug 4, 2025

DarkLight1337 moved this to Done in Multi-modality Core Aug 4, 2025

DarkLight1337 added the multi-modality Related to multi-modality (#4194) label Aug 4, 2025

npanpaliya pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Aug 6, 2025

Remove index_put from MM embeddings merging (vllm-project#22105)

4e189b6

Co-authored-by: Chenxi Yang <cxyang@meta.com>

jinzhen-lin pushed a commit to jinzhen-lin/vllm that referenced this pull request Aug 9, 2025

Remove index_put from MM embeddings merging (vllm-project#22105)

6dbbfa2

Co-authored-by: Chenxi Yang <cxyang@meta.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>

noamgat pushed a commit to noamgat/vllm that referenced this pull request Aug 9, 2025

Remove index_put from MM embeddings merging (vllm-project#22105)

4813ebd

Co-authored-by: Chenxi Yang <cxyang@meta.com> Signed-off-by: Noam Gat <noamgat@gmail.com>

paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025

Remove index_put from MM embeddings merging (vllm-project#22105)

1580fbd

Co-authored-by: Chenxi Yang <cxyang@meta.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>

diegocastanibm pushed a commit to diegocastanibm/vllm that referenced this pull request Aug 15, 2025

Remove index_put from MM embeddings merging (vllm-project#22105)

0e7293d

Co-authored-by: Chenxi Yang <cxyang@meta.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>

epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025

Remove index_put from MM embeddings merging (vllm-project#22105)

7d17bee

Co-authored-by: Chenxi Yang <cxyang@meta.com>

zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025

Remove index_put from MM embeddings merging (vllm-project#22105)

a5d24b9

Co-authored-by: Chenxi Yang <cxyang@meta.com>

DarkLight1337 mentioned this pull request Aug 29, 2025

[Core] Remove host GPU sync in merge_multimodal_embeddings #19578

Closed

-            pinned = torch.empty_like(sampled_token_ids, device='cpu', pin_memory=True)
-            transfer_event = torch.cuda.Event()
-            pinned.copy_(sampled_token_ids, non_blocking=True)
-            transfer_event.record()
-            transfer_event.synchronize()
-            valid_sampled_token_ids = pinned.tolist()
+            valid_sampled_token_ids = sampled_token_ids.tolist()

-        else:
-            raise ValueError("Error during masked scatter operation:", e)
+        else:
+            # Re-raise the original exception to preserve the traceback
+            # for easier debugging of unexpected errors.
+            raise e

Uh oh!

Remove index_put from MM embeddings merging #22105

Remove index_put from MM embeddings merging #22105

Uh oh!

Conversation

chenxi-yang commented Aug 1, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 1, 2025

Uh oh!

facebook-github-bot commented Aug 1, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

ywang96 left a comment

Choose a reason for hiding this comment

Uh oh!

ywang96 Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

lgeiger Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

ywang96 Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

chenxi-yang Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

huachenheli Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Aug 1, 2025

Uh oh!

facebook-github-bot commented Aug 1, 2025

Uh oh!

facebook-github-bot commented Aug 1, 2025

Uh oh!

facebook-github-bot commented Aug 1, 2025

Uh oh!

facebook-github-bot commented Aug 1, 2025

Uh oh!

facebook-github-bot commented Aug 3, 2025

Uh oh!

facebook-github-bot commented Aug 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

chenxi-yang commented Aug 1, 2025 •

edited by github-actions bot

Loading