[WIP][Attention] Sharded kv-cache for MLA #22789

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

LucasWilkinson wants to merge 2 commits into vllm-project:main from neuralmagic:lwilkinson/sharded-mla

Collaborator

LucasWilkinson commented Aug 13, 2025 •

edited by github-actions bot

Loading

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

NOTE!: this is partially AI generated so the code is not yet working and is pretty ugly

See number 3 in: https://docs.google.com/document/d/1L4MmOA3JnVlahjZq5CsQhB0d3G5SayuSR5Y24rKdGpU/edit?usp=sharing

Test Plan

Test Result

(Optional) Documentation Update

LucasWilkinson added 2 commits

August 9, 2025 12:11


          lse cutlass mla

bf72bea

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

wip

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

wip

d489333

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

github-actions bot commented Aug 13, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

mergify bot added the v1 label

gemini-code-assist bot reviewed

View reviewed changes

Contributor

gemini-code-assist bot left a comment

Code Review

This pull request introduces support for sharded KV-cache in Multi-Layer Attention (MLA) to enable tensor parallelism. The changes are extensive, spanning from CUDA kernels to Python-level attention logic and configuration. The core idea is to shard the KV cache blocks across tensor parallel ranks and use all-gather/all-to-all communication patterns to compute the full attention output.

While the overall approach seems sound, my review identified several critical issues in the implementation, particularly within the new sharding logic in vllm/v1/attention/backends/mla/common.py. These include incorrect operator precedence, flawed mathematical formulas for calculating sequence lengths, Python syntax errors, and incorrect tensor reshaping that would lead to runtime errors or incorrect behavior. Given the WIP nature of this PR and the author's note about AI-generated code, these findings are not unexpected. Addressing these issues is crucial for the feature's correctness and functionality.

vllm/v1/attention/backends/mla/common.py

    
                              # Per-request used blocks and mask across existing table width

                              blocks_per_req = (context_lens +

                                                (B - 1)) // B  # [num_prefills]

                              max_blocks_per_req = (+B - 1) // B

Contributor

gemini-code-assist bot Aug 13, 2025

There seems to be a typo here. The + sign is likely intended to be max_context_len_cpu. The current code will raise an error.

Suggested change

      
                            max_blocks_per_req = (+B - 1) // B
          
                            max_blocks_per_req = (max_context_len_cpu + B - 1) // B

vllm/v1/attention/backends/mla/common.py

Comment on lines +667 to +669

    
                              last_page_len = torch.where(

                                  context_lens_cpu > 0 & partial_last & last_owned,

                                  context_lens_cpu % B + 1, B)

Contributor

gemini-code-assist bot Aug 13, 2025

There are two issues in this block:

The condition context_lens_cpu > 0 & partial_last & last_owned is likely incorrect due to operator precedence. The bitwise AND & has higher precedence than >. You should wrap the comparison in parentheses: (context_lens_cpu > 0) & ....
The formula context_lens_cpu % B + 1 for last_page_len is incorrect. For a context length that is a multiple of B, it incorrectly returns 1 instead of B. For other cases, it returns one more than the actual length. A correct formula to get the length of the last page is (context_lens_cpu - 1) % B + 1.

Suggested change

      
                            last_page_len = torch.where(
          
                                context_lens_cpu > 0 & partial_last & last_owned,
          
                                context_lens_cpu % B + 1, B)
          
                            last_page_len = torch.where(
          
                                (context_lens_cpu > 0) & partial_last & last_owned,
          
                                (context_lens_cpu - 1) % B + 1, B)

vllm/v1/attention/backends/mla/common.py

    
                              used_blocks_mask = torch.arange(

                                  max_blocks_per_req, device=decode_block_table.device

                              ) < blocks_per_req.unsqueeze(1)

                              decode_block_table = decode_block_table[~used_blocks_mask] = -1

Contributor

gemini-code-assist bot Aug 13, 2025

This line contains a syntax error. The assignment decode_block_table = decode_block_table[...] = -1 is not valid in Python. You probably intended to modify the tensor in-place.

Suggested change

      
                            decode_block_table = decode_block_table[~used_blocks_mask] = -1
          
                            decode_block_table[~used_blocks_mask] = -1

vllm/v1/attention/backends/mla/common.py

Comment on lines +779 to +781

    
                              decode_block_table = decode_block_table[owned]

                              # Convert to local physical indices

                              decode_block_table = decode_block_table // self.tp_size

Contributor

gemini-code-assist bot Aug 13, 2025

Indexing with a boolean mask owned will flatten the decode_block_table tensor. This is likely unintended, as the block table is expected to be a 2D tensor for subsequent operations. You might want to use torch.where to preserve the tensor's shape, similar to the logic in the prefill path (line 658).

Suggested change

      
                            decode_block_table = decode_block_table[owned]
          
                            # Convert to local physical indices
          
                            decode_block_table = decode_block_table // self.tp_size
          
                            decode_block_table = torch.where(
          
                                owned, decode_block_table // self.tp_size, -1)

vllm/v1/attention/backends/mla/common.py

Comment on lines +1213 to +1214

    
                              parts = context_output_local.view(B, self.num_heads, D) \

                                  .view(B, -1, heads_owned, D).movedim(1, 0)

Contributor

gemini-code-assist bot Aug 13, 2025

self.num_heads refers to the number of local attention heads, but context_output_local at this point contains the output for all heads (global) because it was computed with q_all. You should use self.num_global_heads here to correctly reshape the tensor.

Suggested change

      
                            parts = context_output_local.view(B, self.num_heads, D) \
          
                                .view(B, -1, heads_owned, D).movedim(1, 0)
          
                            parts = context_output_local.view(B, self.num_global_heads, D) \
          
                                .view(B, -1, heads_owned, D).movedim(1, 0)

LucasWilkinson changed the title ~~[WIP][Attention] Shredded kv-cache for MLA~~ [WIP][Attention] Sharded kv-cache for MLA

mergify bot commented Aug 18, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify bot added the needs-rebase label

LucasWilkinson mentioned this pull request

[Feature] Support Decode Context Parallel (DCP) for MLA #23734

Merged

6 tasks

Collaborator Author

LucasWilkinson commented Sep 5, 2025

superseded by: #23734

LucasWilkinson closed this

minosfuture mentioned this pull request

[Kernel] Support decode context parallelism on Blackwell with CUTLASS MLA #24385

Merged

5 tasks

LucasWilkinson mentioned this pull request

[Attention] add DCP support for FLASH_ATTN_MLA backend #24453

Merged

5 tasks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

WoosukKwon Awaiting requested review from WoosukKwon WoosukKwon will be requested when the pull request is marked ready for review WoosukKwon is a code owner

robertgshaw2-redhat Awaiting requested review from robertgshaw2-redhat robertgshaw2-redhat will be requested when the pull request is marked ready for review robertgshaw2-redhat is a code owner

njhill Awaiting requested review from njhill njhill will be requested when the pull request is marked ready for review njhill is a code owner

ywang96 Awaiting requested review from ywang96 ywang96 will be requested when the pull request is marked ready for review ywang96 is a code owner

comaniac Awaiting requested review from comaniac comaniac will be requested when the pull request is marked ready for review comaniac is a code owner

alexm-redhat Awaiting requested review from alexm-redhat alexm-redhat will be requested when the pull request is marked ready for review alexm-redhat is a code owner

heheda12345 Awaiting requested review from heheda12345 heheda12345 will be requested when the pull request is marked ready for review heheda12345 is a code owner

ApostaC Awaiting requested review from ApostaC ApostaC will be requested when the pull request is marked ready for review ApostaC is a code owner

pavanimajety Awaiting requested review from pavanimajety pavanimajety will be requested when the pull request is marked ready for review pavanimajety is a code owner

simon-mo Awaiting requested review from simon-mo simon-mo will be requested when the pull request is marked ready for review simon-mo is a code owner

youkaichao Awaiting requested review from youkaichao youkaichao will be requested when the pull request is marked ready for review youkaichao is a code owner

mgoin Awaiting requested review from mgoin mgoin will be requested when the pull request is marked ready for review mgoin is a code owner

tlrmchlsmth Awaiting requested review from tlrmchlsmth tlrmchlsmth will be requested when the pull request is marked ready for review tlrmchlsmth is a code owner

houseroad Awaiting requested review from houseroad houseroad will be requested when the pull request is marked ready for review houseroad is a code owner

hmellor Awaiting requested review from hmellor hmellor will be requested when the pull request is marked ready for review hmellor is a code owner

yewentao256 Awaiting requested review from yewentao256 yewentao256 will be requested when the pull request is marked ready for review yewentao256 is a code owner

ProExpertProg Awaiting requested review from ProExpertProg ProExpertProg will be requested when the pull request is marked ready for review ProExpertProg is a code owner

1 more reviewer

gemini-code-assist[bot] gemini-code-assist[bot] left review comments

Labels

needs-rebase v1