-
-
Notifications
You must be signed in to change notification settings - Fork 11k
[Perf] Disable chunked local attention by default with llama4 #21761
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Perf] Disable chunked local attention by default with llama4 #21761
Conversation
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly addresses a performance regression by disabling chunked local attention with the hybrid KV cache manager by default, while providing an environment variable to re-enable it. The implementation is sound. My only suggestion is to update a comment to more accurately reflect that the change is a performance optimization, which will improve code clarity and maintainability.
| # Hybrid KV cache manager is not yet supported with chunked | ||
| # local attention. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is slightly misleading as it suggests the feature is unsupported, whereas the PR description and warning log indicate it's a performance regression. To improve clarity for future maintenance, it would be better to state the performance-related reason for disabling it.
| # Hybrid KV cache manager is not yet supported with chunked | |
| # local attention. | |
| # Disable hybrid KV cache manager with chunked local attention | |
| # due to a performance regression. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kind of agree with Gemini here, although you say this in your log
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for the moment
| # Hybrid KV cache manager is not yet supported with chunked | ||
| # local attention. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kind of agree with Gemini here, although you say this in your log
| self.scheduler_config.disable_hybrid_kv_cache_manager = True | ||
| elif \ | ||
| not envs.VLLM_ALLOW_CHUNKED_LOCAL_ATTN_WITH_HYBRID_KV_CACHE: | ||
| logger.warning( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: warning_once
|
Merging to solve the regression since we have better solutions on the way |
…roject#21761) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
|
@LucasWilkinson will we reduce metadata creation with refactoring? |
|
thats the plan; we are working towards: https://vllm-dev.slack.com/archives/C07R5Q1Q2BB/p1753727605258469?thread_ts=1753202489.248869&cid=C07R5Q1Q2BB but that will be a followup PR |
…roject#21761) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: x22x22 <wadeking@qq.com>
…roject#21761) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
…roject#21761) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
…roject#21761) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
…roject#21761) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Noam Gat <noamgat@gmail.com>
…roject#21761) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>
…roject#21761) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>
…roject#21761) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
…roject#21761) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Purpose
Currently using the hybrid kv-cache with llama4s chunked local attention causes a latency ~2ms since when the hybrid kv-cache manager is used we end up with 3 ChunkedLocalAttention kv-cache spec groups. We end up with the following groups:
This results in attn metadata and local virtual batches for the local layers being constructed 3 times adding latency:
Test Plan
see: #21707
Test Result
see: #21707
(Optional) Documentation Update