Allocate kv_cache with stride order #16605

wenscarl · 2025-04-14T16:37:14Z

Allow KV cache manager to support an stride order to the allocation which the attention backend could provide. Mainly affect Flashinfer backend. Ref. #8200
@tlrmchlsmth @LucasWilkinson

github-actions · 2025-04-14T16:37:23Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

LucasWilkinson

Thanks for doing this! this is looking much better! Left a few nits, please also fix the pre-commit failures

vllm/utils.py

vllm/worker/cache_engine.py

csrc/cache_kernels.cu

vllm/worker/cache_engine.py

LucasWilkinson

LGTM, thanks for the contribution!

mergify · 2025-04-23T14:42:26Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @wenscarl.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

DarkLight1337 · 2025-04-24T14:51:12Z

Can you merge from main to fix docker build?

Signed-off-by: shuw <shuw@nvidia.com>

wenscarl · 2025-04-24T15:39:10Z

Can you merge from main to fix docker build?

Done.

DarkLight1337 · 2025-04-25T03:13:14Z

Please check whether the failing kernels test is related to this PR

Signed-off-by: shuw <shuw@nvidia.com>

wenscarl · 2025-04-25T14:49:16Z

Please check whether the failing kernels test is related to this PR

Only test_cache.py failure is related. Fixed by reducing tensor size to avoid OOM. The spec-decoding tests should be irrelevant.

Signed-off-by: shuw <shuw@nvidia.com>

Signed-off-by: shuw <shuw@nvidia.com> Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>

Signed-off-by: shuw <shuw@nvidia.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>

Signed-off-by: shuw <shuw@nvidia.com> Signed-off-by: Yuqi Zhang <yuqizhang@google.com>

wenscarl requested review from WoosukKwon, alexm-redhat, comaniac, njhill, tlrmchlsmth, youkaichao and zhuohan123 as code owners April 14, 2025 16:37

LucasWilkinson reviewed Apr 15, 2025

View reviewed changes

vllm/utils.py Outdated Show resolved Hide resolved

vllm/worker/cache_engine.py Outdated Show resolved Hide resolved

tlrmchlsmth reviewed Apr 15, 2025

View reviewed changes

csrc/cache_kernels.cu Outdated Show resolved Hide resolved

csrc/cache_kernels.cu Outdated Show resolved Hide resolved

wenscarl force-pushed the cache_alloc_w_stride branch 3 times, most recently from 7cd508d to 8a6c7fe Compare April 16, 2025 14:27

wenscarl requested review from LucasWilkinson and tlrmchlsmth April 16, 2025 14:29

LucasWilkinson reviewed Apr 16, 2025

View reviewed changes

vllm/worker/cache_engine.py Outdated Show resolved Hide resolved

wenscarl requested a review from LucasWilkinson April 16, 2025 14:58

LucasWilkinson approved these changes Apr 16, 2025

View reviewed changes

wenscarl force-pushed the cache_alloc_w_stride branch from c13e5d6 to 343378f Compare April 22, 2025 15:02

simon-mo added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 22, 2025

simon-mo enabled auto-merge (squash) April 22, 2025 16:42

mergify bot added the needs-rebase label Apr 23, 2025

auto-merge was automatically disabled April 23, 2025 14:58
Head branch was pushed to by a user without write access

wenscarl force-pushed the cache_alloc_w_stride branch from 18d6e6e to aba3c45 Compare April 23, 2025 15:06

mergify bot removed the needs-rebase label Apr 23, 2025

Reshape_and_cache_flash kernel to be kv-cache layout aware.

33d5969

Signed-off-by: shuw <shuw@nvidia.com>

wenscarl force-pushed the cache_alloc_w_stride branch from aba3c45 to 33d5969 Compare April 24, 2025 15:38

Reduce test size

a48be76

Signed-off-by: shuw <shuw@nvidia.com>

wenscarl force-pushed the cache_alloc_w_stride branch from 7dbc7ef to a48be76 Compare April 25, 2025 14:46

LucasWilkinson enabled auto-merge (squash) April 25, 2025 18:05

vllm-bot merged commit 9e96f56 into vllm-project:main Apr 26, 2025
71 of 73 checks passed

jikunshang pushed a commit to jikunshang/vllm that referenced this pull request Apr 29, 2025

Allocate kv_cache with stride order (vllm-project#16605)

9babeeb

Signed-off-by: shuw <shuw@nvidia.com>

lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025

Allocate kv_cache with stride order (vllm-project#16605)

cd76bd1

Signed-off-by: shuw <shuw@nvidia.com>

adobrzyn pushed a commit to HabanaAI/vllm-fork that referenced this pull request Apr 30, 2025

Allocate kv_cache with stride order (vllm-project#16605)

a85d450

Signed-off-by: shuw <shuw@nvidia.com> Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>

RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025

Allocate kv_cache with stride order (vllm-project#16605)

3fc39f0

Signed-off-by: shuw <shuw@nvidia.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>

ckhordiasma mentioned this pull request May 14, 2025

nm vllm ent 0.8.5 sync red-hat-data-services/vllm#139

Merged

zzzyq pushed a commit to zzzyq/vllm that referenced this pull request May 24, 2025

Allocate kv_cache with stride order (vllm-project#16605)

726e0c3

Signed-off-by: shuw <shuw@nvidia.com> Signed-off-by: Yuqi Zhang <yuqizhang@google.com>

NickLucche mentioned this pull request May 27, 2025

[V1] Allocate kv_cache with stride order for V1 #18775

Merged

gronsti-amd mentioned this pull request May 28, 2025

[Bug][Regression]: Dimension out of range when using MooncakeStoreConnector #18834

Open

NickLucche mentioned this pull request Jun 6, 2025

[V1][Kernel] Flashinfer HND KV cache layout #19280

Merged

67lc mentioned this pull request Jun 16, 2025

[Bug]:MooncakeStoreConnector with IndexError('Dimension out of range (expected to be in range of [-2, 1], but got 2)') ,because the new cache_kernels.cu changed. #19683

Closed

1 task

Uh oh!

Uh oh!

Allocate kv_cache with stride order #16605

Allocate kv_cache with stride order #16605

Uh oh!

Conversation

wenscarl commented Apr 14, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Apr 14, 2025

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Apr 23, 2025

Uh oh!

DarkLight1337 commented Apr 24, 2025

Uh oh!

wenscarl commented Apr 24, 2025

Uh oh!

DarkLight1337 commented Apr 25, 2025

Uh oh!

wenscarl commented Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

wenscarl commented Apr 14, 2025 •

edited by github-actions bot

Loading

wenscarl commented Apr 25, 2025 •

edited

Loading