Skip to content

Conversation

@chenxi-yang
Copy link
Contributor

@chenxi-yang chenxi-yang commented Aug 1, 2025

Summary:
Previously, merge_multimodal_embeddings used inputs_embeds[is_multimodal]=flattened to merge the MM embeddings. This index_put operations calls non_zero in pytorch, which forces an additional D2H shape check sync point.
This diff uses masked_scatter
to bypass the D2H point. The latency changes from 35ms to 0.08ms.

Test Plan:
E2E Results:
Before:

QPS:                 0.91
Avg latency:         8.542s
Avg TTFT (client):   4186.02ms
P50 TTFT (client):   4589.01ms
P99 TTFT (client):   6897.92ms
Avg TTIT (client):   21.78ms
P50 TTIT (client):   19.99ms
P99 TTIT (client):   41.25ms
Avg TTFT (server):   5657.19ms
Avg TTIT (server):   130.71ms
Avg prefill len:     22284.20 tokens
P50 prefill len:     22284.00 tokens
P99 prefill len:     22291.00 tokens
Avg decode len:      200.00 tokens
P50 decode len:      200.00 tokens
P99 decode len:      200.00 tokens

After:

QPS:                 0.94
Avg latency:         8.456s
Avg TTFT (client):   4089.09ms
P50 TTFT (client):   4510.47ms
P99 TTFT (client):   6902.79ms
Avg TTIT (client):   21.83ms
P50 TTIT (client):   19.89ms
P99 TTIT (client):   41.02ms
Avg TTFT (server):   3808.06ms
Avg TTIT (server):   79.76ms
Avg prefill len:     22284.26 tokens
P50 prefill len:     22284.00 tokens
P99 prefill len:     22291.00 tokens
Avg decode len:      200.00 tokens
P50 decode len:      200.00 tokens
P99 decode len:      200.00 tokens

Rollback Plan:

Differential Revision: D79405697

@github-actions
Copy link

github-actions bot commented Aug 1, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D79405697

@mergify mergify bot added the v1 label Aug 1, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant performance optimization by replacing index_put with masked_scatter_ for merging multimodal embeddings, which avoids a D2H sync point. The latency improvement is impressive.

I've identified two issues:

  1. In vllm/model_executor/models/utils.py, the exception handling for the new masked_scatter_ operation can obscure the original error traceback, making debugging more difficult. I've suggested a change to preserve the original exception.
  2. There's an unrelated change in vllm/v1/worker/gpu_model_runner.py that modifies how token IDs are copied from GPU to CPU. This should be moved to a separate pull request for clarity and proper review.

Once these points are addressed, the main change looks good to go.

Comment on lines 1708 to 1713
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This change appears to be unrelated to the PR's title and description, which are about optimizing multimodal embeddings merging. This change modifies the GPU-to-CPU data transfer for sampled_token_ids.

Mixing unrelated changes in a single PR makes the review process difficult and the commit history hard to follow. Please revert this change and submit it as a separate pull request with its own title, description, and justification.

Additionally, the new implementation using torch.cuda.Event and synchronize() appears to be a synchronous operation, similar to the original .tolist(). The performance benefits are not immediately obvious and would need to be explained and benchmarked in its own PR.

Suggested change
pinned = torch.empty_like(sampled_token_ids, device='cpu', pin_memory=True)
transfer_event = torch.cuda.Event()
pinned.copy_(sampled_token_ids, non_blocking=True)
transfer_event.record()
transfer_event.synchronize()
valid_sampled_token_ids = pinned.tolist()
valid_sampled_token_ids = sampled_token_ids.tolist()

Comment on lines 433 to 435
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Raising a new ValueError here obscures the original traceback from the RuntimeError. This can make it more difficult to debug issues that are not related to a shape mismatch but still cause a RuntimeError (e.g., type or device mismatches).

To preserve the full context of the error, it's better to re-raise the original exception or chain it.

Suggested change
else:
raise ValueError("Error during masked scatter operation:", e)
else:
# Re-raise the original exception to preserve the traceback
# for easier debugging of unexpected errors.
raise e

Copy link
Member

@ywang96 ywang96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contritbution and I left some comments! Can you do some end-to-end benchmark and show the overall improvement? Thanks!

Comment on lines 422 to 424
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making this change!

I was also hoping PyTorch can address this pytorch/pytorch#57515, but let's make this change on our side.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ywang96 With the upgrade to PyTorch 2.9 this should be handled correctly for index put with a boolean mask as long as the mask is on CPU which at least used to be the case here. See pytorch/pytorch#156384 for more info.

Comment on lines 1708 to 1713
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is unrelated and I believe this is accidental - please remove

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add this to llama4's private repo?

@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D79405697

@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D79405697

@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D79405697

@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D79405697

1 similar comment
@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D79405697

@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D79405697

Summary:
Pull Request resolved: vllm-project#22105

Previously, _merge_multimodal_embeddings used `inputs_embeds[is_multimodal]=flattened` to merge the MM embeddings. This index_put operations calls non_zero in pytorch, which forces an additional D2H shape check sync point.
This diff uses masked_scatter_ to bypass the D2H point. The latency changes from 35ms to 0.08ms.

Test Plan:
E2E Results:
Before:
QPS:                 0.91
Avg latency:         8.542s
Avg TTFT (client):   4186.02ms
P50 TTFT (client):   4589.01ms
P99 TTFT (client):   6897.92ms
Avg TTIT (client):   21.78ms
P50 TTIT (client):   19.99ms
P99 TTIT (client):   41.25ms
Avg TTFT (server):   5657.19ms
Avg TTIT (server):   130.71ms
Avg prefill len:     22284.20 tokens
P50 prefill len:     22284.00 tokens
P99 prefill len:     22291.00 tokens
Avg decode len:      200.00 tokens
P50 decode len:      200.00 tokens
P99 decode len:      200.00 tokens
After:
QPS:                 0.94
Avg latency:         8.456s
Avg TTFT (client):   4089.09ms
P50 TTFT (client):   4510.47ms
P99 TTFT (client):   6902.79ms
Avg TTIT (client):   21.83ms
P50 TTIT (client):   19.89ms
P99 TTIT (client):   41.02ms
Avg TTFT (server):   3808.06ms
Avg TTIT (server):   79.76ms
Avg prefill len:     22284.26 tokens
P50 prefill len:     22284.00 tokens
P99 prefill len:     22291.00 tokens
Avg decode len:      200.00 tokens
P50 decode len:      200.00 tokens
P99 decode len:      200.00 tokens

Rollback Plan:

Differential Revision: D79405697
@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D79405697

@DarkLight1337 DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 4, 2025
@DarkLight1337 DarkLight1337 enabled auto-merge (squash) August 4, 2025 02:29
@vllm-bot vllm-bot merged commit e5949e5 into vllm-project:main Aug 4, 2025
50 of 53 checks passed
@DarkLight1337 DarkLight1337 added the multi-modality Related to multi-modality (#4194) label Aug 4, 2025
npanpaliya pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Aug 6, 2025
Co-authored-by: Chenxi Yang <cxyang@meta.com>
jinzhen-lin pushed a commit to jinzhen-lin/vllm that referenced this pull request Aug 9, 2025
Co-authored-by: Chenxi Yang <cxyang@meta.com>
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
noamgat pushed a commit to noamgat/vllm that referenced this pull request Aug 9, 2025
Co-authored-by: Chenxi Yang <cxyang@meta.com>
Signed-off-by: Noam Gat <noamgat@gmail.com>
paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025
Co-authored-by: Chenxi Yang <cxyang@meta.com>
Signed-off-by: Paul Pak <paulpak58@gmail.com>
diegocastanibm pushed a commit to diegocastanibm/vllm that referenced this pull request Aug 15, 2025
Co-authored-by: Chenxi Yang <cxyang@meta.com>
Signed-off-by: Diego-Castan <diego.castan@ibm.com>
epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025
Co-authored-by: Chenxi Yang <cxyang@meta.com>
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025
Co-authored-by: Chenxi Yang <cxyang@meta.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

multi-modality Related to multi-modality (#4194) ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

8 participants