-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
Override attention metadata for fast prefill in some KV sharing setups #21590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces an option to compute and propagate padded logits_indices to the model's forward pass, controlled by the VLLM_COMPUTE_PADDED_LOGITS_INDICES environment variable. The changes span environment variable handling, forward context management, and the GPU model runner logic. My review found a critical typo in the environment variable name that would prevent the feature from working, and a potential IndexError when handling an empty logits_indices tensor. Both issues include code suggestions for fixes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking of adding the final flag enable_kv_sharing_truncated_prefill to this PR instead of introducing a temporary environment variable. We can add a warning to show that this feature is WIP.
I think a better granularity to split PRs is to compute a real gen_indices_padded in this PR.
Then you can test it with Gemma-3n which has YOCO. Can you add a simple test based on this model and update it gradually during implementing more small PRs? For this PR, I think we can run the model as normal, and returns a tensor with
second_hidden_states = hidden_states[gen_indices_padded],
second_residual = residual[gen_indices_padded],
hidden_states = torch.randn()
residual = torch.randn()
hidden_states[gen_indices] = second_hidden_states[:num_gen_indices]
residual[gen_indices] = second_residual[:num_gen_indices]
…ndices Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your update. Do we need to pass the args via LLM class? Other comments are some small nits.
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
You're right, we don't need it. Updated to address comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you very much.
vllm-project#21590) Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
|
|
taking a look now (EDIT: seems test is flaky, partial fix in #22038) |
vllm-project#21590) Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
vllm-project#21590) Signed-off-by: Yong Hoon Shin <yhshin@meta.com> Signed-off-by: x22x22 <wadeking@qq.com>
vllm-project#21590) Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
vllm-project#21590) Signed-off-by: Yong Hoon Shin <yhshin@meta.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
vllm-project#21590) Signed-off-by: Yong Hoon Shin <yhshin@meta.com> Signed-off-by: Noam Gat <noamgat@gmail.com>
vllm-project#21590) Signed-off-by: Yong Hoon Shin <yhshin@meta.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>
vllm-project#21590) Signed-off-by: Yong Hoon Shin <yhshin@meta.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>
vllm-project#21590) Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
vllm-project#21590) Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Purpose
To support YOCO prefill skip for cross-decoder layers in #19719, we need to propagate
logits_indicesthat has been padded to align with cudagraph captured batch sizes as well as size of the originallogits_indices.To do this, we can identify which layers are eligible, store these layer names in
truncated_prefill_eligible_layersand dynamically build a subclass of attention metadata for these layers that inherit from the original attn metadata class but adds two new fields:logits_indices_paddedandnum_logits_indices.Test Plan
unit test:
run with new flag:
Test Result
Unit tests pass
See new warning msg: