-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
use combo kernel to fuse qk-norm and qk-rope #26682
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Boyuan Feng <boyuan@meta.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new configuration option, use_horizontal_fusion, to enable horizontal fusion for qk-norm and qk-rope operations in PyTorch Inductor. This is achieved by setting combo_kernels and benchmark_combo_kernel in the Inductor configuration when the feature is enabled and the PyTorch version is 2.9.0.dev or newer. My main feedback is regarding the default value of the new flag. For stability, it would be safer to disable this experimental feature by default and allow users to opt-in.
vllm/config/compilation.py
Outdated
| since we know all keys are in a range [0, max_capture_size], | ||
| we can optimize it to list[int] for better lookup performance.""" | ||
|
|
||
| use_horizontal_fusion = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use_horizontal_fusion flag is enabled by default. This will automatically enable the combo_kernels feature in PyTorch Inductor for users on versions 2.9.0.dev or newer. Since this relies on a feature in a development version of PyTorch, it may be unstable. It would be safer to set this to False by default to prevent potential issues for users on bleeding-edge PyTorch versions. Users can then explicitly opt-in to enable this experimental feature.
| use_horizontal_fusion = True | |
| use_horizontal_fusion = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to enable by default since it benefits models in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Signed-off-by: Boyuan Feng <boyuan@meta.com>
|
Main thing I think around here is testing, do we have a plan around that or are we yolo-ing this? @BoyuanFeng @ProExpertProg |
| # use horizontal fusion, which is useful for fusing qk-norm and | ||
| # qk-rope when query and key have different shapes. | ||
| self.inductor_compile_config["combo_kernels"] = True | ||
| self.inductor_compile_config["benchmark_combo_kernel"] = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would appreciate a doc pointer to how this works so I can understand for future work. Currently this is very opaque
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have some doc here. I will followup with more pytorch docs.
|
@zou3519 tested for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yolo here seems acceptable we can add tests in the future
Signed-off-by: Boyuan Feng <boyuan@meta.com> Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>
Signed-off-by: Boyuan Feng <boyuan@meta.com> Signed-off-by: bbartels <benjamin@bartels.dev>
Signed-off-by: Boyuan Feng <boyuan@meta.com>
Signed-off-by: Boyuan Feng <boyuan@meta.com>
Signed-off-by: Boyuan Feng <boyuan@meta.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Signed-off-by: Boyuan Feng <boyuan@meta.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Signed-off-by: Boyuan Feng <boyuan@meta.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
Signed-off-by: Boyuan Feng <boyuan@meta.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
This PR enables horizontal fusion from inductor. This is helpful for fusing q-norm & k-norm into 1 kernel; and q-rope & k-rope into 1 kernel. It was NOT fused before since q and k have different shapes, which prevent some optimizations to happen.
The following trace comes from qwen3-0.6b.
Before:
After (together w/ #26680):

After qkv_proj, we reduces from 5 kernels to 2 kernels: 1 for qk-norm and 1 for qk-rope.
Performance
Applying this PR + #26680
Qwen/Qwen3-0.6B
Before:

After:
