-
-
Notifications
You must be signed in to change notification settings - Fork 12.5k
[CPU] Refactor CPU attention backend #27954
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant refactoring of the CPU attention backend, replacing the previous implementation with a new unified kernel. This new kernel adds support for features like sliding window, alibi, softcap, and sink, and includes optimizations for AMX BF16 instructions. The changes are extensive, touching build configurations, C++ kernels, Python backend logic, and tests. While the refactoring is a great improvement, I've identified a critical regression that breaks support for non-causal attention, which is necessary for encoder-decoder models. My review includes suggestions to address this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
Amazing Work! Thanks for raising this :) |
|
The changes this PR introduces are massive |
|
|
||
| self.sinks = sinks | ||
| if self.sinks is not None: | ||
| assert self.sinks.shape[0] == num_heads, ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When use_sdpa_prefill is true we use vanilla SDPA which does not support sinks.
Can we dispatch to cpu_attention_with_kv_cache when we have sinks even if use_sdpa_prefill is true?
If that's not possible for whatever reason, we should raise an error, and I can address it for the non-Intel CPU path in a follow-up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems can't do this as we are unable to get sink config in builder. Just add a assertion for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, let's just fail for now, I'll address this in a follow up PR.
| @@ -0,0 +1,31 @@ | |||
| #ifndef SCRATCHPAD_MANAGER_H | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we leave oneDNN changes out? This PR is already too big and I don't think these changes are relevant the new attention backend?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is a bit unrelevant. But I think it is acceptable as just a few code😂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha okay!!
| s_aux=s_aux, | ||
| ) | ||
|
|
||
| atol, rtol = 1.5e-2, 1e-2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the absolute tolerance looks too high.
can we use:
from tests.kernels.allclose_default import get_default_atol, get_default_rtol
atol = get_default_atol(output)
rtol = get_default_rtol(output)
similar to what we do in https://github.com/vllm-project/vllm/blob/main/tests/kernels/attention/test_attention.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test file is based on https://github.com/vllm-project/vllm/blob/main/tests/kernels/attention/test_flash_attn.py
I think the abs tolerance looks more strict in test_attention.py is because the input is initialized with uniform_(-scale, scale), will be smaller compared with the inputs initialized withrandn in test_flash_attn.py.
But I perfer to use randn as I found using small inputs can't figure out value difference in test cases of sink sometimes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Acknowledged, I wasn't aware that's the tolerance used for testing flash attention.
| ): | ||
| skip = True | ||
|
|
||
| # only tests features with bf16 to save time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then why not just do QTYPES = [torch.bfloat16]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means we just test sink, alibi, softcap with bf16 as the logits processing is using fp32. For other cases all dtypes should be tested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I missed the second line of the condition. I agree with you
| "Qwen/Qwen3-8B", # qwen (text-only) | ||
| ), | ||
| pytest.param("stabilityai/stablelm-3b-4e1t"), # stablelm | ||
| pytest.param("bigcode/starcoder2-3b"), # starcoder2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we enable a test for google/gemma-2-2b-it and mark it as cpu_model?
This would be a great end-to-end smoke test for SWA and Hybrid local-global attention models (with 2 kv cache groups)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, added a gemma-2 case.
vllm/platforms/cpu.py
Outdated
| " intel_extension_for_pytorch" | ||
| ) | ||
| if cache_config.block_size % 32 != 0: | ||
| block_size = cache_config.block_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm in favor of just erroring out in this case saying that block_size needs to be divisible by 32 (instead of setting the block_size to a value that the user didn't choose)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes...But my concern is a lot of test cases use 16 by default and I don't want to add more if-else in different files so just round it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that's a good point. I agree with you.
|
I ran some end to end tests on Arm Neoverse-V2 with I eye-balled the end-to-end generations with this new attention backend and can confirm that all generations are meaningful and close enough to what one gets with huggingface |
|
I can also confirm that there's no perf regressions on Arm after running this benchmark: |
csrc/cpu/cpu_attn_impl.hpp
Outdated
| int32_t reduction_split_num; | ||
| int32_t thread_num; | ||
| int32_t | ||
| effective_thread_num; // non-zero item num in cu_workitem_num_per_thread |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you please add a comment explaining this?
what's the significance of the cu_ prefix here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh it is a mistake, should be workitem_num_per_thread. The cu means cummulation, and the cu_workitem_num_per_thread is a array contains prefix sum of workitem_num_per_thread.
|
I got one test failure with one element miss-match while running This is the configuration that fails: And this is the log: @bigPYJ1151 I'm happy to take a deeper look at this, unless you have hints on what might be the issue? |
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Signed-off-by: jiang1.li <jiang1.li@intel.com> Signed-off-by: George D. Torres <gdavtor@gmail.com>
PR vllm-project#27954 added cpu_attention_with_kv_cache which supports chucked prefill, prefix caching, SWA, alibi, softcap and sinks. However, it's currently disabled for prefill on Arm CPUs because it's slower than torch.sdpa for relatively long prefills. Hence chunked prefill, prefix caching, sinks, etc remained unsupported on Arm. This PR accelerates cpu_attention_with_kv_cache on Arm CPUs by introducing NEON accelerated GEMMs (enabled with ISA::NEON) for QK and PV. With the new GEMMs, performance of cpu_attention_with_kv_cache is similar to torch.sdpa for long prefills, which allows us to enable cpu_attention_with_kv_cache for prefill path on Arm and thus enable chunked prefill, prefix caching, sinks, alibi, softcap, etc. Performance: Uplift with ISA::NEON vs ISA::VEC: For batch size = 64, query tokens = kv tokens = 512, q heads = 32, kv heads - 8, head size = 128, block size = 128: using ISA::NEON for cpu_attention_with_kv_cache accelerates prefill attention by 2x compared to the current state with ISA::VEC For the throughput benchmark below on Arm Neoverse-V2, using cpu_attention_with_kv_cache for prefills and decodes: ISA::NEON yields ~ %13 higher throughput than ISA::VEC and similar throughput to using torch.sdpa for prefill. ``` export VLLM_CPU_OMP_THREADS_BIND=0-63 export LD_PRELOAD="/usr/lib/aarch64-linux-gnu/libtcmalloc_minimal.so.4:/usr/lib/aarch64-linux-gnu/libgomp.so.1" export VLLM_TARGET_DEVICE=cpu export VLLM_CPU_KVCACHE_SPACE=64 vllm bench throughput \ --num-prompts 128 \ --seed 0 \ --dataset-name sharegpt \ --input-len 1024 \ --output-len 128 \ --max-model-len 2048 \ --max-num-batched-tokens 8192 \ --model meta-llama/Llama-3.1-8B-Instruct \ --load-format dummy ``` Future PRs will accelerate attention further by introducing faster/vectorized exp implementations and leveraging bfmmla/bfdot for QK, PV on Arm CPUs with bf16. Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
PR vllm-project#27954 added cpu_attention_with_kv_cache which supports chucked prefill, prefix caching, SWA, alibi, softcap and sinks. However, it's currently disabled for prefill on Arm CPUs because it's slower than torch.sdpa for relatively long prefills. Hence chunked prefill, prefix caching, sinks, etc remained unsupported on Arm. This PR accelerates cpu_attention_with_kv_cache on Arm CPUs by introducing NEON accelerated GEMMs (enabled with ISA::NEON) for QK and PV. With the new GEMMs, performance of cpu_attention_with_kv_cache is similar to torch.sdpa for long prefills, which allows us to enable cpu_attention_with_kv_cache for prefill path on Arm and thus enable chunked prefill, prefix caching, sinks, alibi, softcap, etc. Performance: Uplift with ISA::NEON vs ISA::VEC: For batch size = 64, query tokens = kv tokens = 512, q heads = 32, kv heads - 8, head size = 128, block size = 128: using ISA::NEON for cpu_attention_with_kv_cache accelerates prefill attention by 2x compared to the current state with ISA::VEC For the throughput benchmark below on Arm Neoverse-V2, using cpu_attention_with_kv_cache for prefills and decodes: ISA::NEON yields ~ %13 higher throughput than ISA::VEC and similar throughput to using torch.sdpa for prefill. ``` export VLLM_CPU_OMP_THREADS_BIND=0-63 export LD_PRELOAD="/usr/lib/aarch64-linux-gnu/libtcmalloc_minimal.so.4:/usr/lib/aarch64-linux-gnu/libgomp.so.1" export VLLM_TARGET_DEVICE=cpu export VLLM_CPU_KVCACHE_SPACE=64 vllm bench throughput \ --num-prompts 128 \ --seed 0 \ --dataset-name sharegpt \ --input-len 1024 \ --output-len 128 \ --max-model-len 2048 \ --max-num-batched-tokens 8192 \ --model meta-llama/Llama-3.1-8B-Instruct \ --load-format dummy ``` Future PRs will accelerate attention further by introducing faster/vectorized exp implementations and leveraging bfmmla/bfdot for QK, PV on Arm CPUs with bf16. Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
PR vllm-project#27954 added cpu_attention_with_kv_cache which supports chucked prefill, prefix caching, SWA, alibi, softcap and sinks. However, it's currently disabled for prefill on Arm CPUs because it's slower than torch.sdpa for relatively long prefills. Hence chunked prefill, prefix caching, sinks, etc remained unsupported on Arm. This PR accelerates cpu_attention_with_kv_cache on Arm CPUs by introducing NEON accelerated GEMMs (enabled with ISA::NEON) for QK and PV. With the new GEMMs, performance of cpu_attention_with_kv_cache is similar to torch.sdpa for long prefills, which allows us to enable cpu_attention_with_kv_cache for prefill path on Arm and thus enable chunked prefill, prefix caching, sinks, alibi, softcap, etc. Performance: Uplift with ISA::NEON vs ISA::VEC: For batch size = 64, query tokens = kv tokens = 512, q heads = 32, kv heads - 8, head size = 128, block size = 128: using ISA::NEON for cpu_attention_with_kv_cache accelerates prefill attention by 2x compared to the current state with ISA::VEC For the throughput benchmark below on Arm Neoverse-V2, using cpu_attention_with_kv_cache for prefills and decodes: ISA::NEON yields ~ %13 higher throughput than ISA::VEC and similar throughput to using torch.sdpa for prefill. ``` export VLLM_CPU_OMP_THREADS_BIND=0-63 export LD_PRELOAD="/usr/lib/aarch64-linux-gnu/libtcmalloc_minimal.so.4:/usr/lib/aarch64-linux-gnu/libgomp.so.1" export VLLM_TARGET_DEVICE=cpu export VLLM_CPU_KVCACHE_SPACE=64 vllm bench throughput \ --num-prompts 128 \ --seed 0 \ --dataset-name sharegpt \ --input-len 1024 \ --output-len 128 \ --max-model-len 2048 \ --max-num-batched-tokens 8192 \ --model meta-llama/Llama-3.1-8B-Instruct \ --load-format dummy ``` Future PRs will accelerate attention further by introducing faster/vectorized exp implementations and leveraging bfmmla/bfdot for QK, PV on Arm CPUs with bf16. Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
PR vllm-project#27954 added cpu_attention_with_kv_cache which supports chucked prefill, prefix caching, SWA, alibi, softcap and sinks. However, it's currently disabled for prefill on Arm CPUs because it's slower than torch.sdpa for relatively long prefills. Hence chunked prefill, prefix caching, sinks, etc remained unsupported on Arm. This PR accelerates cpu_attention_with_kv_cache on Arm CPUs by introducing NEON accelerated GEMMs (enabled with ISA::NEON) for QK and PV. With the new GEMMs, performance of cpu_attention_with_kv_cache is similar to torch.sdpa for long prefills, which allows us to enable cpu_attention_with_kv_cache for prefill path on Arm and thus enable chunked prefill, prefix caching, sinks, alibi, softcap, etc. Performance: Uplift with ISA::NEON vs ISA::VEC: For batch size = 64, query tokens = kv tokens = 512, q heads = 32, kv heads - 8, head size = 128, block size = 128: using ISA::NEON for cpu_attention_with_kv_cache accelerates prefill attention by 2x compared to the current state with ISA::VEC For the throughput benchmark below on Arm Neoverse-V2, using cpu_attention_with_kv_cache for prefills and decodes: ISA::NEON yields ~ %13 higher throughput than ISA::VEC and similar throughput to using torch.sdpa for prefill. ``` export VLLM_CPU_OMP_THREADS_BIND=0-63 export LD_PRELOAD="/usr/lib/aarch64-linux-gnu/libtcmalloc_minimal.so.4:/usr/lib/aarch64-linux-gnu/libgomp.so.1" export VLLM_TARGET_DEVICE=cpu export VLLM_CPU_KVCACHE_SPACE=64 vllm bench throughput \ --num-prompts 128 \ --seed 0 \ --dataset-name sharegpt \ --input-len 1024 \ --output-len 128 \ --max-model-len 2048 \ --max-num-batched-tokens 8192 \ --model meta-llama/Llama-3.1-8B-Instruct \ --load-format dummy ``` Future PRs will accelerate attention further by introducing faster/vectorized exp implementations and leveraging bfmmla/bfdot for QK, PV on Arm CPUs with bf16. Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Signed-off-by: jiang1.li <jiang1.li@intel.com> Signed-off-by: Xingyu Liu <charlotteliu12x@gmail.com>
|
@bigPYJ1151 possible to add the new build flag in the document ? https://docs.vllm.ai/en/latest/getting_started/installation/cpu/#build-image-from-source |
Purpose
This PR refactors CPU attention backend, includes:
TorchSDPABackendtoCPUAttentionBackendfor less misunderstandings. For now theTORCH_SDPAtag is only used for ViT attentioncc @fadara01 @Akashcodes732
this:
main:
Test Plan
unit tests
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.