-
-
Notifications
You must be signed in to change notification settings - Fork 12.5k
[cpu][feat] Enable SWA and hybrid local-global models for CPU backend #27650
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request enables Sliding Window Attention (SWA) and hybrid local-global attention models for the CPU backend. It also includes a bug fix for an in-place tensor modification and refactors attention test utilities into a common module. The changes are well-structured, and the addition of a new unit test for the CPU decode attention phase is a valuable improvement. The SWA implementation is block-granular, which is a reasonable approach given the current paged attention API. However, I've identified a critical bug in the refactored test utility code that needs to be addressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
48fa62e to
89b58e3
Compare
|
@codex review |
89b58e3 to
152d9b1
Compare
💡 Codex Reviewvllm/vllm/v1/attention/backends/cpu_attn.py Lines 849 to 851 in 89b58e3
The new ℹ️ About Codex in GitHubCodex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback". |
- Enables Sliding Window Attention (SWA) for the CPU backend. - Enables models with hybrid local-global attention (i.e., those with multiple KV cache groups) for the CPU backend. - Fixes a bug in `TorchSDPAMetadataBuilderV1` where `query_start_loc_cpu` was updated in-place despite being shared between multiple KV cache groups. - Adds a unit test for the decode phase in the CPU attention backend. - Adds reference implementations used to test paged attention to a common utils module and reuses it for the decode attention CPU test. SWA is enabled by truncating full blocks/pages that are outside the window (similar to vllm-project#23010). Hence, the implementation is block-granular and may attend to at most (block_size - 1) extra tokens; this did not affect the accuracy tests I ran. This is the best we can do with the current paged-attention API/implementation, which does not support SWA natively See: vllm-project#4768 for context and the attempt to enable native SWA support in paged_attn. Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
152d9b1 to
d6294a2
Compare
|
Oh. Thanks for the efforts. Looks like we have some overlaps. I'am also working on refactoring the CPU attention backend to clean up code and enable features including GQA, SWA, softcap, alibi, attention sink, chunked prefill togther. The code is hosted in vLLM so all arch can leverage it. see: https://github.com/bigPYJ1151/vllm/tree/new_attn |
|
@bigPYJ1151 thank you for your comment and the heads-up. w.r.t timelines, when do you expect your changes to make it into vllm? |
|
Closing in favor of #27954 |
Purpose
TorchSDPAMetadataBuilderV1wherequery_start_loc_cpuwas updated in-place despite being shared between multiple KV cache groups.SWA is enabled by truncating full blocks/pages that are outside the window (similar to #23010). Hence, the implementation is block-granular and may attend to at most (block_size - 1) extra tokens (but no degradation in generation quality for the tests I ran).
This is the best we can do with the current paged-attention API/implementation, which does not support SWA natively See: #4768 for context and the attempt to enable native SWA support in paged_attn.
Test Plan
google/gemma-2-2b-it(alternates between full attention and SWA with window=4096). The tests include prompts (e.g. asking the model to explain code) with lengths>and<window size(e.g.4740,6618,6047,25,26) and generations of up to1024tokens.Test Result
google/gemma-2-2b-itwithvllmand made sure that all generations are meaningful and close enough to what one gets withtransformersEssential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.