-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
Remove index_put from MM embeddings merging #22105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
|
This pull request was exported from Phabricator. Differential Revision: D79405697 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant performance optimization by replacing index_put with masked_scatter_ for merging multimodal embeddings, which avoids a D2H sync point. The latency improvement is impressive.
I've identified two issues:
- In
vllm/model_executor/models/utils.py, the exception handling for the newmasked_scatter_operation can obscure the original error traceback, making debugging more difficult. I've suggested a change to preserve the original exception. - There's an unrelated change in
vllm/v1/worker/gpu_model_runner.pythat modifies how token IDs are copied from GPU to CPU. This should be moved to a separate pull request for clarity and proper review.
Once these points are addressed, the main change looks good to go.
vllm/v1/worker/gpu_model_runner.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change appears to be unrelated to the PR's title and description, which are about optimizing multimodal embeddings merging. This change modifies the GPU-to-CPU data transfer for sampled_token_ids.
Mixing unrelated changes in a single PR makes the review process difficult and the commit history hard to follow. Please revert this change and submit it as a separate pull request with its own title, description, and justification.
Additionally, the new implementation using torch.cuda.Event and synchronize() appears to be a synchronous operation, similar to the original .tolist(). The performance benefits are not immediately obvious and would need to be explained and benchmarked in its own PR.
| pinned = torch.empty_like(sampled_token_ids, device='cpu', pin_memory=True) | |
| transfer_event = torch.cuda.Event() | |
| pinned.copy_(sampled_token_ids, non_blocking=True) | |
| transfer_event.record() | |
| transfer_event.synchronize() | |
| valid_sampled_token_ids = pinned.tolist() | |
| valid_sampled_token_ids = sampled_token_ids.tolist() |
vllm/model_executor/models/utils.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Raising a new ValueError here obscures the original traceback from the RuntimeError. This can make it more difficult to debug issues that are not related to a shape mismatch but still cause a RuntimeError (e.g., type or device mismatches).
To preserve the full context of the error, it's better to re-raise the original exception or chain it.
| else: | |
| raise ValueError("Error during masked scatter operation:", e) | |
| else: | |
| # Re-raise the original exception to preserve the traceback | |
| # for easier debugging of unexpected errors. | |
| raise e |
ywang96
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contritbution and I left some comments! Can you do some end-to-end benchmark and show the overall improvement? Thanks!
vllm/model_executor/models/utils.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making this change!
I was also hoping PyTorch can address this pytorch/pytorch#57515, but let's make this change on our side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ywang96 With the upgrade to PyTorch 2.9 this should be handled correctly for index put with a boolean mask as long as the mask is on CPU which at least used to be the case here. See pytorch/pytorch#156384 for more info.
vllm/v1/worker/gpu_model_runner.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is unrelated and I believe this is accidental - please remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
vllm/model_executor/models/utils.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add this to llama4's private repo?
891f0df to
d70d130
Compare
|
This pull request was exported from Phabricator. Differential Revision: D79405697 |
d70d130 to
f9f57cd
Compare
f9f57cd to
060b313
Compare
|
This pull request was exported from Phabricator. Differential Revision: D79405697 |
060b313 to
ad36b11
Compare
ad36b11 to
afae798
Compare
|
This pull request was exported from Phabricator. Differential Revision: D79405697 |
afae798 to
5651164
Compare
|
This pull request was exported from Phabricator. Differential Revision: D79405697 |
1 similar comment
|
This pull request was exported from Phabricator. Differential Revision: D79405697 |
78054a0 to
ddf4b3f
Compare
|
This pull request was exported from Phabricator. Differential Revision: D79405697 |
Summary: Pull Request resolved: vllm-project#22105 Previously, _merge_multimodal_embeddings used `inputs_embeds[is_multimodal]=flattened` to merge the MM embeddings. This index_put operations calls non_zero in pytorch, which forces an additional D2H shape check sync point. This diff uses masked_scatter_ to bypass the D2H point. The latency changes from 35ms to 0.08ms. Test Plan: E2E Results: Before: QPS: 0.91 Avg latency: 8.542s Avg TTFT (client): 4186.02ms P50 TTFT (client): 4589.01ms P99 TTFT (client): 6897.92ms Avg TTIT (client): 21.78ms P50 TTIT (client): 19.99ms P99 TTIT (client): 41.25ms Avg TTFT (server): 5657.19ms Avg TTIT (server): 130.71ms Avg prefill len: 22284.20 tokens P50 prefill len: 22284.00 tokens P99 prefill len: 22291.00 tokens Avg decode len: 200.00 tokens P50 decode len: 200.00 tokens P99 decode len: 200.00 tokens After: QPS: 0.94 Avg latency: 8.456s Avg TTFT (client): 4089.09ms P50 TTFT (client): 4510.47ms P99 TTFT (client): 6902.79ms Avg TTIT (client): 21.83ms P50 TTIT (client): 19.89ms P99 TTIT (client): 41.02ms Avg TTFT (server): 3808.06ms Avg TTIT (server): 79.76ms Avg prefill len: 22284.26 tokens P50 prefill len: 22284.00 tokens P99 prefill len: 22291.00 tokens Avg decode len: 200.00 tokens P50 decode len: 200.00 tokens P99 decode len: 200.00 tokens Rollback Plan: Differential Revision: D79405697
ddf4b3f to
e0343ac
Compare
|
This pull request was exported from Phabricator. Differential Revision: D79405697 |
Co-authored-by: Chenxi Yang <cxyang@meta.com>
Co-authored-by: Chenxi Yang <cxyang@meta.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
Co-authored-by: Chenxi Yang <cxyang@meta.com> Signed-off-by: Noam Gat <noamgat@gmail.com>
Co-authored-by: Chenxi Yang <cxyang@meta.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>
Co-authored-by: Chenxi Yang <cxyang@meta.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>
Co-authored-by: Chenxi Yang <cxyang@meta.com>
Co-authored-by: Chenxi Yang <cxyang@meta.com>
Summary:
Previously, merge_multimodal_embeddings used
inputs_embeds[is_multimodal]=flattenedto merge the MM embeddings. This index_put operations calls non_zero in pytorch, which forces an additional D2H shape check sync point.This diff uses masked_scatter to bypass the D2H point. The latency changes from 35ms to 0.08ms.
Test Plan:
E2E Results:
Before:
After:
Rollback Plan:
Differential Revision: D79405697