-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
[Perf][Attention] Replace torch.zeros with torch.empty to reduce overhead #28182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request replaces torch.zeros with torch.empty for allocating attention output buffers to reduce overhead, which is a good performance optimization. You've correctly identified and handled the case where the buffer might not be written to in profiling runs by explicitly filling it with zeros, thus preserving the original behavior. However, this change highlights a pre-existing critical bug in qwen3_next.py where a buffer is allocated with an unpadded size but may be written to with a padded size when using CUDAGraphs, leading to a potential out-of-bounds write. I've provided a detailed comment and a code suggestion to fix this issue. The changes in vllm/model_executor/layers/kda.py appear correct and do not have this issue.
|
I don't know why we have performance regression on kimi linear. Do you have any idea? @BoyuanFeng |
I think |
|
lol did not see triton_poi_fused_4 |
Yes, for kimi linear, every time performance of this pr is lower than main |
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
|
I'll leave kimi linear optimization in next pr since it has performance regression |
|
As I remember we need |
|
@codex review |
|
Codex Review: Didn't find any major issues. Swish! ℹ️ About Codex in GitHubCodex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback". |
Do you have some related code? The gsm8k result shows that there is no accuracy problem. |
|
Could you try and maybe different values of |
|
Yes, I think you are right. The result are quite low now
|
|
@vadiklyutiy Maybe we should add a comment to explain why we should not use |
|
I ran tests on the main branch and observed inconsistent results across different commands.
|
|
I think all 3 options generates different prompts and some difference in scopes are expected. |

Purpose
As #26680 and #19784 mentioned, using
torch.zerosto allocate attention output buffer will introduce an unnecessary kernel overhead.qwen3-next as an example:


main
this pr
A triton kernel was eliminated
accuracy test
Perf test
qwen3-next
TL;DR: throughput 6340.16 -> 6756.52
this pr
main
kimi linear has performance regression
### kimi linearthis pr
main
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.