[Perf] Optimize `reshape_and_cache_flash` CUDA Kernel #22036

yewentao256 · 2025-07-31T21:21:00Z

Purpose

Using vectorization utils to reshape_and_cache_flash and get performance improvement

Test

Acc

lm_eval   --model vllm   --model_args "pretrained=Qwen/Qwen3-30B-A3B-FP8,max_model_len=32768,enforce_eager=True"   --trust_remote_code   --tasks gsm8k   --num_fewshot 5   --batch_size auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8173|±  |0.0106|
|     |       |strict-match    |     5|exact_match|↑  |0.8870|±  |0.0087|
# main
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8173|±  |0.0106|
|     |       |strict-match    |     5|exact_match|↑  |0.8870|±  |0.0087|

pytest test_cache.py -x
==================== test session starts ====================
platform linux -- Python 3.12.3, pytest-8.4.0, pluggy-1.6.0
rootdir: /home/wentao/vllm-source
configfile: pyproject.toml
plugins: asyncio-1.0.0, anyio-4.9.0
asyncio: mode=Mode.STRICT, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 1102 items                                        

test_cache.py ....................................... [  3%]
...................................s...s...s...s...s. [  8%]
..s...s...s...s...s...s...s...s...s...s...s...s...s.. [ 13%]
..................................................... [ 17%]
....................s...s...s...s...s...s...s...s...s [ 22%]
...s...s...s...s...s...s...s...s...s................. [ 27%]
..................................................... [ 32%]
..................................................... [ 37%]
..................................................... [ 42%]
..................................................... [ 46%]
..................................................... [ 51%]
..................................................... [ 56%]
..................................................... [ 61%]
..................................................... [ 66%]
..................................................... [ 70%]
...........s.ss.sssss.ss.ss.sssss.ss.ss.sssss.ss.ss.s [ 75%]
ssss.ss.ss.sssss.ss.ss.sssss.ss.ss.sssss.ss.ss.sssss. [ 80%]
ss.ss.sssss.ss.ss.sssss.ss.ss.sssss.ss.ss.sssss.ss.ss [ 85%]
.sssss.ss.ss.sssss.ss.ss.sssss.ss.ss.sssss.ss.ss.ssss [ 90%]
s.ss.ss.sssss.s...................................... [ 94%]
..................................................... [ 99%]
sss                                                   [100%]

======= 901 passed, 201 skipped in 349.21s (0:05:49) ========

Performance

python benchmark_reshape_and_cache_flash.py

num_tokens	layout	Old Run (µs)	New Run (µs)	Change (%)
2	HND	10.326	8.323	-19.4% 🚀
4	HND	10.440	8.355	-20.0% 🚀
8	HND	10.356	8.344	-19.4% 🚀
16	HND	10.330	8.372	-19.0% 🚀
32	HND	10.345	8.348	-19.3% 🚀
64	HND	10.454	8.354	-20.1% 🚀
128	HND	10.397	8.370	-19.5% 🚀
256	HND	14.431	10.375	-28.1% 🚀
512	HND	24.809	20.137	-18.8% 🚀
1024	HND	51.389	45.196	-12.1% 🚀
2048	HND	96.466	77.908	-19.2% 🚀
4096	HND	175.695	147.068	-16.3% 🚀
8192	HND	336.814	279.106	-17.1% 🚀
16384	HND	668.001	547.169	-18.1% 🚀
32768	HND	1320.570	1082.070	-18.1% 🚀
65536	HND	2605.930	2149.950	-17.5% 🚀
2	NHD	10.371	6.649	-35.9% 🚀
4	NHD	10.337	6.407	-38.0% 🚀
8	NHD	10.346	6.338	-38.7% 🚀
16	NHD	10.352	6.394	-38.2% 🚀
32	NHD	10.350	7.416	-28.3% 🚀
64	NHD	10.341	7.305	-29.4% 🚀
128	NHD	10.349	7.614	-26.4% 🚀
256	NHD	14.401	10.363	-28.0% 🚀
512	NHD	25.955	15.084	-41.9% 🚀
1024	NHD	49.264	30.690	-37.7% 🚀
2048	NHD	93.674	53.726	-42.6% 🚀
4096	NHD	172.364	101.030	-41.4% 🚀
8192	NHD	333.329	195.911	-41.2% 🚀
16384	NHD	665.351	385.012	-42.1% 🚀
32768	NHD	1308.720	762.607	-41.7% 🚀
65536	NHD	2587.800	1519.310	-41.3% 🚀

Signed-off-by: yewentao256 <zhyanwentao@126.com>

gemini-code-assist

Code Review

This pull request optimizes the reshape_and_cache_flash CUDA kernel by using vectorization, which results in significant performance improvements. The changes look good, but there is a critical correctness issue. The new implementation assumes a contiguous memory layout for the (num_heads, head_size) dimensions in the KV cache, which is only true for the NHD layout. This breaks support for the HND layout, which is also a supported configuration. I've provided a detailed comment with a suggested fix to address this.

csrc/cache_kernels.cu

github-actions · 2025-07-31T21:23:26Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: yewentao256 <zhyanwentao@126.com>

robertgshaw2-redhat · 2025-08-01T00:09:55Z

wow, nice work

mgoin

LGTM, vectorize_with_alignment should deal with uneven shapes and existing CI should cover this. I'll make sure to unblock a full run just in case

…2036) Signed-off-by: yewentao256 <zhyanwentao@126.com>

…2036) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>

…2036) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Noam Gat <noamgat@gmail.com>

…2036) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>

…2036) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>

…2036) Signed-off-by: yewentao256 <zhyanwentao@126.com>

yewentao256 added 2 commits July 31, 2025 17:15

optimize reshape and cache flash kernel

ec2e746

Signed-off-by: yewentao256 <zhyanwentao@126.com>

add benchmark script

1d25423

Signed-off-by: yewentao256 <zhyanwentao@126.com>

mergify bot added the performance Performance-related issues label Jul 31, 2025

gemini-code-assist bot reviewed Jul 31, 2025

View reviewed changes

csrc/cache_kernels.cu Outdated Show resolved Hide resolved

yewentao256 added 4 commits July 31, 2025 17:45

Fallback HND

8c4484e

Signed-off-by: yewentao256 <zhyanwentao@126.com>

HND optimize

27546f6

Signed-off-by: yewentao256 <zhyanwentao@126.com>

optimize HND and update benchmark script

8896ba3

Signed-off-by: yewentao256 <zhyanwentao@126.com>

update comments

f850fb5

Signed-off-by: yewentao256 <zhyanwentao@126.com>

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 1, 2025

mgoin approved these changes Aug 1, 2025

View reviewed changes

mgoin merged commit eefbf4a into vllm-project:main Aug 1, 2025
106 of 108 checks passed

mgoin mentioned this pull request Aug 5, 2025

Update rms_norm_kernel by removing redundant global memory loads #22134

Closed

npanpaliya pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Aug 6, 2025

[Perf] Optimize reshape_and_cache_flash CUDA Kernel (vllm-project#2…

0776d55

…2036) Signed-off-by: yewentao256 <zhyanwentao@126.com>

jinzhen-lin pushed a commit to jinzhen-lin/vllm that referenced this pull request Aug 9, 2025

[Perf] Optimize reshape_and_cache_flash CUDA Kernel (vllm-project#2…

8854ac4

…2036) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>

noamgat pushed a commit to noamgat/vllm that referenced this pull request Aug 9, 2025

[Perf] Optimize reshape_and_cache_flash CUDA Kernel (vllm-project#2…

417c8f8

…2036) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Noam Gat <noamgat@gmail.com>

paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025

[Perf] Optimize reshape_and_cache_flash CUDA Kernel (vllm-project#2…

8883b90

…2036) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>

diegocastanibm pushed a commit to diegocastanibm/vllm that referenced this pull request Aug 15, 2025

[Perf] Optimize reshape_and_cache_flash CUDA Kernel (vllm-project#2…

59b5f69

…2036) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>

yewentao256 mentioned this pull request Aug 24, 2025

Vectorize RMSNorm CUDA kernel #22602

Closed

epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025

[Perf] Optimize reshape_and_cache_flash CUDA Kernel (vllm-project#2…

018781e

…2036) Signed-off-by: yewentao256 <zhyanwentao@126.com>

zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025

[Perf] Optimize reshape_and_cache_flash CUDA Kernel (vllm-project#2…

64db329

…2036) Signed-off-by: yewentao256 <zhyanwentao@126.com>

yewentao256 deleted the wye-optimize-reshape-and-cache-flash branch September 25, 2025 19:34

This was referenced Sep 25, 2025

[Feature]: [Perf] Optimize reshape_and_cache CUDA Kernel #25705

Closed

[Perf] Optimize reshape_and_cache CUDA Kernel #25955

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Perf] Optimize `reshape_and_cache_flash` CUDA Kernel #22036

[Perf] Optimize `reshape_and_cache_flash` CUDA Kernel #22036

Uh oh!

yewentao256 commented Jul 31, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

github-actions bot commented Jul 31, 2025

Uh oh!

robertgshaw2-redhat commented Aug 1, 2025

Uh oh!

mgoin left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

[Perf] Optimize reshape_and_cache_flash CUDA Kernel #22036

[Perf] Optimize reshape_and_cache_flash CUDA Kernel #22036

Uh oh!

Conversation

yewentao256 commented Jul 31, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test

Acc

Performance

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

github-actions bot commented Jul 31, 2025

Uh oh!

robertgshaw2-redhat commented Aug 1, 2025

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Perf] Optimize `reshape_and_cache_flash` CUDA Kernel #22036

[Perf] Optimize `reshape_and_cache_flash` CUDA Kernel #22036

yewentao256 commented Jul 31, 2025 •

edited by github-actions bot

Loading