-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
[Perf] Optimize reshape_and_cache_flash CUDA Kernel
#22036
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Perf] Optimize reshape_and_cache_flash CUDA Kernel
#22036
Conversation
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request optimizes the reshape_and_cache_flash CUDA kernel by using vectorization, which results in significant performance improvements. The changes look good, but there is a critical correctness issue. The new implementation assumes a contiguous memory layout for the (num_heads, head_size) dimensions in the KV cache, which is only true for the NHD layout. This breaks support for the HND layout, which is also a supported configuration. I've provided a detailed comment with a suggested fix to address this.
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
wow, nice work |
mgoin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, vectorize_with_alignment should deal with uneven shapes and existing CI should cover this. I'll make sure to unblock a full run just in case
…2036) Signed-off-by: yewentao256 <zhyanwentao@126.com>
…2036) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
…2036) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Noam Gat <noamgat@gmail.com>
…2036) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>
…2036) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>
…2036) Signed-off-by: yewentao256 <zhyanwentao@126.com>
…2036) Signed-off-by: yewentao256 <zhyanwentao@126.com>
Purpose
Using vectorization utils to
reshape_and_cache_flashand get performance improvementTest
Acc
Performance
python benchmark_reshape_and_cache_flash.py