[Kernel] cuda kernels for upcoming decode context parallel feature #23791

youzhedian · 2025-08-28T03:16:28Z

Pre-PR for #1367

Suggestions from @youkaichao : to accelerate the review and merge (especially ci testing), maybe we can split the kernel side changes to a separate PR and get it merged first.

gemini-code-assist

Code Review

This pull request introduces new CUDA kernels, cp_fused_concat_and_cache_mla and cp_gather_cache, to support an upcoming context parallel feature. The changes include the kernel implementations, their PyTorch bindings, and corresponding Python wrappers. Overall, the kernel implementations appear correct and follow existing patterns in the codebase. My main feedback is on the testing coverage. The new cp_fused_concat_and_cache_mla kernel is missing a unit test, and the test for cp_gather_cache is incomplete as it doesn't cover the key functionality it introduces. Adding comprehensive tests is crucial for ensuring the correctness and maintainability of these new kernels.

gemini-code-assist · 2025-08-28T03:18:20Z

tests/kernels/attention/test_cache.py

+def test_cp_gather_cache_mla(kv_lora_rank, qk_rope_head_dim, block_size,
+                             num_blocks, max_seq_len, batch_size, dtype,
+                             kv_cache_dtype, device):


The test for cp_gather_cache is incomplete. The main purpose of this new kernel is to support arbitrary seq_starts, but the test only covers the case where seq_starts is None. Please add test cases that use non-zero seq_starts to validate this new functionality.

Additionally, the test only uses batch_size=8. The kernel has different logic for num_splits based on batch_size. It would be beneficial to test with a wider range of batch sizes to cover all branches, for example [8, 70, 130].

gemini-code-assist · 2025-08-28T03:18:20Z

tests/kernels/attention/test_cache.py

+    )
+
+    ops.cp_gather_cache(src_cache, dst, block_table, cu_seq_lens, batch_size)
+    torch.testing.assert_close(dst, expected)


The new CUDA kernel cp_fused_concat_and_cache_mla is missing a unit test. Adding a test is crucial to ensure its correctness and prevent regressions. Please add a test case for this new functionality, similar to test_concat_and_cache_mla.

youkaichao

LGTM since this only adds two new kernels. cc @WoosukKwon @LucasWilkinson if you have more comments.

youkaichao · 2025-08-28T07:29:01Z

kernel tests passed, failed tests are unrelated. merging.

gshtras · 2025-08-28T16:46:04Z

kernel tests passed, failed tests are unrelated. merging.

Not really. AMD build failure is not unrelated

…llm-project#23791) Co-authored-by: hongchao <hongchao@msh.team>

draftbk · 2025-08-28T23:48:15Z

I am encountering an issue that is very likely caused by this PR when building on AMD MI300X.
Shell Script

# Run
python setup.py clean && python setup.py build
# Error
FAILED: [code=1] CMakeFiles/_C.dir/csrc/cache_kernels.hip.o

/data/users/lifans/gitrepos/vllm/build/temp.linux-x86_64-cpython-312/csrc/cache_kernels.hip:918:14: error: unused variable 'is_last_split' [-Werror,-Wunused-variable]
  918 |   const bool is_last_split = (split_end == tot_slots);
      |              ^~~~~~~~~~~~~
2 warnings and 1 error generated when compiling for gfx942.

If I switch back to commit c07a733 (2 commits ahead), the error disappears.

Signed-off-by: charlifu <charlifu@amd.com>

…llm-project#23791) Co-authored-by: hongchao <hongchao@msh.team>

…#23847) Signed-off-by: charlifu <charlifu@amd.com>

…llm-project#23791) Co-authored-by: hongchao <hongchao@msh.team>

…#23847) Signed-off-by: charlifu <charlifu@amd.com>

[Kernel] cuda kernels for upcoming context parallel feature

afe1616

youzhedian requested review from WoosukKwon, tlrmchlsmth and yewentao256 as code owners August 28, 2025 03:16

gemini-code-assist bot reviewed Aug 28, 2025

View reviewed changes

youzhedian changed the title ~~[Kernel] cuda kernels for upcoming context parallel feature~~ [Kernel] cuda kernels for upcoming decode context parallel feature Aug 28, 2025

youzhedian mentioned this pull request Aug 28, 2025

[Feature] Support Decode Context Parallel (DCP) for MLA #23734

Merged

6 tasks

youkaichao approved these changes Aug 28, 2025

View reviewed changes

youkaichao merged commit 186aced into vllm-project:main Aug 28, 2025
15 of 18 checks passed

zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025

[Kernel] cuda kernels for upcoming decode context parallel feature (v…

eb7a14c

…llm-project#23791) Co-authored-by: hongchao <hongchao@msh.team>

draftbk mentioned this pull request Aug 28, 2025

[rocm] fix rocm setup.py build #23871

Closed

vllm-bot pushed a commit that referenced this pull request Aug 29, 2025

[ROCm][Fix] Fix rocm build caused by #23791 (#23847)

006477e

Signed-off-by: charlifu <charlifu@amd.com>

zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Sep 3, 2025

[Kernel] cuda kernels for upcoming decode context parallel feature (v…

6ecb02c

…llm-project#23791) Co-authored-by: hongchao <hongchao@msh.team>

eicherseiji pushed a commit to eicherseiji/vllm that referenced this pull request Sep 9, 2025

[Kernel] cuda kernels for upcoming decode context parallel feature (v…

d052579

…llm-project#23791) Co-authored-by: hongchao <hongchao@msh.team>

eicherseiji pushed a commit to eicherseiji/vllm that referenced this pull request Sep 9, 2025

[ROCm][Fix] Fix rocm build caused by vllm-project#23791 (vllm-project…

22a85eb

…#23847) Signed-off-by: charlifu <charlifu@amd.com>

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025

[Kernel] cuda kernels for upcoming decode context parallel feature (v…

1c5ab43

…llm-project#23791) Co-authored-by: hongchao <hongchao@msh.team>

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025

[ROCm][Fix] Fix rocm build caused by vllm-project#23791 (vllm-project…

ab5f392

…#23847) Signed-off-by: charlifu <charlifu@amd.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

[Kernel] cuda kernels for upcoming decode context parallel feature #23791

[Kernel] cuda kernels for upcoming decode context parallel feature #23791

Uh oh!

youzhedian commented Aug 28, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Aug 28, 2025

Uh oh!

gemini-code-assist bot Aug 28, 2025

Uh oh!

youkaichao left a comment

Uh oh!

youkaichao commented Aug 28, 2025

Uh oh!

Uh oh!

gshtras commented Aug 28, 2025

Uh oh!

draftbk commented Aug 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Uh oh!

[Kernel] cuda kernels for upcoming decode context parallel feature #23791

[Kernel] cuda kernels for upcoming decode context parallel feature #23791

Uh oh!

Conversation

youzhedian commented Aug 28, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

youkaichao left a comment

Choose a reason for hiding this comment

Uh oh!

youkaichao commented Aug 28, 2025

Uh oh!

Uh oh!

gshtras commented Aug 28, 2025

Uh oh!

draftbk commented Aug 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

youzhedian commented Aug 28, 2025 •

edited by github-actions bot

Loading