Skip to content

Conversation

@yewentao256
Copy link
Member

@yewentao256 yewentao256 commented Jul 31, 2025

Purpose

Using vectorization utils to reshape_and_cache_flash and get performance improvement

Test

Acc

lm_eval   --model vllm   --model_args "pretrained=Qwen/Qwen3-30B-A3B-FP8,max_model_len=32768,enforce_eager=True"   --trust_remote_code   --tasks gsm8k   --num_fewshot 5   --batch_size auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match||0.8173|±  |0.0106|
|     |       |strict-match    |     5|exact_match||0.8870|±  |0.0087|
# main
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match||0.8173|±  |0.0106|
|     |       |strict-match    |     5|exact_match||0.8870|±  |0.0087|
pytest test_cache.py -x
==================== test session starts ====================
platform linux -- Python 3.12.3, pytest-8.4.0, pluggy-1.6.0
rootdir: /home/wentao/vllm-source
configfile: pyproject.toml
plugins: asyncio-1.0.0, anyio-4.9.0
asyncio: mode=Mode.STRICT, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 1102 items                                        

test_cache.py ....................................... [  3%]
...................................s...s...s...s...s. [  8%]
..s...s...s...s...s...s...s...s...s...s...s...s...s.. [ 13%]
..................................................... [ 17%]
....................s...s...s...s...s...s...s...s...s [ 22%]
...s...s...s...s...s...s...s...s...s................. [ 27%]
..................................................... [ 32%]
..................................................... [ 37%]
..................................................... [ 42%]
..................................................... [ 46%]
..................................................... [ 51%]
..................................................... [ 56%]
..................................................... [ 61%]
..................................................... [ 66%]
..................................................... [ 70%]
...........s.ss.sssss.ss.ss.sssss.ss.ss.sssss.ss.ss.s [ 75%]
ssss.ss.ss.sssss.ss.ss.sssss.ss.ss.sssss.ss.ss.sssss. [ 80%]
ss.ss.sssss.ss.ss.sssss.ss.ss.sssss.ss.ss.sssss.ss.ss [ 85%]
.sssss.ss.ss.sssss.ss.ss.sssss.ss.ss.sssss.ss.ss.ssss [ 90%]
s.ss.ss.sssss.s...................................... [ 94%]
..................................................... [ 99%]
sss                                                   [100%]

======= 901 passed, 201 skipped in 349.21s (0:05:49) ========

Performance

python benchmark_reshape_and_cache_flash.py

num_tokens layout Old Run (µs) New Run (µs) Change (%)
2 HND 10.326 8.323 -19.4% 🚀
4 HND 10.440 8.355 -20.0% 🚀
8 HND 10.356 8.344 -19.4% 🚀
16 HND 10.330 8.372 -19.0% 🚀
32 HND 10.345 8.348 -19.3% 🚀
64 HND 10.454 8.354 -20.1% 🚀
128 HND 10.397 8.370 -19.5% 🚀
256 HND 14.431 10.375 -28.1% 🚀
512 HND 24.809 20.137 -18.8% 🚀
1024 HND 51.389 45.196 -12.1% 🚀
2048 HND 96.466 77.908 -19.2% 🚀
4096 HND 175.695 147.068 -16.3% 🚀
8192 HND 336.814 279.106 -17.1% 🚀
16384 HND 668.001 547.169 -18.1% 🚀
32768 HND 1320.570 1082.070 -18.1% 🚀
65536 HND 2605.930 2149.950 -17.5% 🚀
2 NHD 10.371 6.649 -35.9% 🚀
4 NHD 10.337 6.407 -38.0% 🚀
8 NHD 10.346 6.338 -38.7% 🚀
16 NHD 10.352 6.394 -38.2% 🚀
32 NHD 10.350 7.416 -28.3% 🚀
64 NHD 10.341 7.305 -29.4% 🚀
128 NHD 10.349 7.614 -26.4% 🚀
256 NHD 14.401 10.363 -28.0% 🚀
512 NHD 25.955 15.084 -41.9% 🚀
1024 NHD 49.264 30.690 -37.7% 🚀
2048 NHD 93.674 53.726 -42.6% 🚀
4096 NHD 172.364 101.030 -41.4% 🚀
8192 NHD 333.329 195.911 -41.2% 🚀
16384 NHD 665.351 385.012 -42.1% 🚀
32768 NHD 1308.720 762.607 -41.7% 🚀
65536 NHD 2587.800 1519.310 -41.3% 🚀

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
@mergify mergify bot added the performance Performance-related issues label Jul 31, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request optimizes the reshape_and_cache_flash CUDA kernel by using vectorization, which results in significant performance improvements. The changes look good, but there is a critical correctness issue. The new implementation assumes a contiguous memory layout for the (num_heads, head_size) dimensions in the KV cache, which is only true for the NHD layout. This breaks support for the HND layout, which is also a supported configuration. I've provided a detailed comment with a suggested fix to address this.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
@robertgshaw2-redhat
Copy link
Collaborator

wow, nice work

@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 1, 2025
Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, vectorize_with_alignment should deal with uneven shapes and existing CI should cover this. I'll make sure to unblock a full run just in case

@mgoin mgoin merged commit eefbf4a into vllm-project:main Aug 1, 2025
106 of 108 checks passed
npanpaliya pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Aug 6, 2025
…2036)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
jinzhen-lin pushed a commit to jinzhen-lin/vllm that referenced this pull request Aug 9, 2025
…2036)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
noamgat pushed a commit to noamgat/vllm that referenced this pull request Aug 9, 2025
…2036)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Noam Gat <noamgat@gmail.com>
paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025
…2036)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Paul Pak <paulpak58@gmail.com>
diegocastanibm pushed a commit to diegocastanibm/vllm that referenced this pull request Aug 15, 2025
…2036)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Diego-Castan <diego.castan@ibm.com>
epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025
…2036)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025
…2036)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
@yewentao256 yewentao256 deleted the wye-optimize-reshape-and-cache-flash branch September 25, 2025 19:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants