-
-
Notifications
You must be signed in to change notification settings - Fork 11.4k
[Model] Mamba2 preallocate SSM output tensor to avoid d2d copy overhead #21075
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a performance optimization by pre-allocating the SSM output tensor, which avoids an unnecessary device-to-device copy. The approach is sound and the changes are well-contained. I've identified one critical issue related to tensor sharding that would cause an assertion failure when using tensor parallelism. Addressing this should make the implementation robust.
|
This pull request has merge conflicts that must be resolved before it can be |
f9ab16e to
5f73b79
Compare
875c81f to
3873218
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a reasonable optimization.
My main comment is that this leaves the interface to the mamba_ssm functions more complicated than they were before. Now they support both in-place updating and out-of-place allocation of the outputs. And we need to handle those two cases in a few different places.
Could we change it to always be in-place instead?
I think I kept the original logic as a fall back, but you're right, we can remove them. I will push a simplified version if it is safe to remove. |
|
@tlrmchlsmth
|
|
I tried to run both plamo2 and phi4flash on main (not the PR branch) and they both failed to run. |
3873218 to
b165a18
Compare
|
Fixed models that calls the affected kernels plamo2phi4flash |
Head branch was pushed to by a user without write access
b165a18 to
19651f2
Compare
Signed-off-by: Chih-Chieh Yang <7364402+cyang49@users.noreply.github.com>
Signed-off-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com>
Signed-off-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com>
Signed-off-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com>
Signed-off-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com>
d59b61d to
97c9a70
Compare
…ad (vllm-project#21075) Signed-off-by: Chih-Chieh Yang <7364402+cyang49@users.noreply.github.com> Signed-off-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com>
| state_batch_indices=None, | ||
| pad_slot_id=PAD_SLOT_ID, | ||
| preallocated_ssm_out=None): | ||
| out=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think out needs to be a required argument now, because it is not allocated within the function anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Will address this in an upcoming PR
…ad (vllm-project#21075) Signed-off-by: Chih-Chieh Yang <7364402+cyang49@users.noreply.github.com> Signed-off-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
…ad (vllm-project#21075) Signed-off-by: Chih-Chieh Yang <7364402+cyang49@users.noreply.github.com> Signed-off-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com> Signed-off-by: Noam Gat <noamgat@gmail.com>
…ad (vllm-project#21075) Signed-off-by: Chih-Chieh Yang <7364402+cyang49@users.noreply.github.com> Signed-off-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>
…ad (vllm-project#21075) Signed-off-by: Chih-Chieh Yang <7364402+cyang49@users.noreply.github.com> Signed-off-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>
…ad (vllm-project#21075) Signed-off-by: Chih-Chieh Yang <7364402+cyang49@users.noreply.github.com> Signed-off-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com>
…ad (vllm-project#21075) Signed-off-by: Chih-Chieh Yang <7364402+cyang49@users.noreply.github.com> Signed-off-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com>
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Purpose
This PR uses preallocated output tensor for SSM output both from decode and prefill paths, instead of allocating individual tensors and then concatenating with
torch.vstack. We observed that the original approach causes unnecessary D2D copy.Test Plan
Test Result
Experiments were done on single H100-80GB.
benchmark_serving.py
Before (#1c3198b)
After
No performance degradation.
lm_eval
Before (#1c3198b)
After
(Optional) Documentation Update