Separate MLAAttention class from Attention #25103

therealnaveenkamal · 2025-09-17T22:06:09Z

Purpose

This PR implements the first step of #24620 by separating Multi-Head Latent Attention into its own dedicated AttentionLayerBase subclass.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

github-actions · 2025-09-17T22:06:20Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

gemini-code-assist

Code Review

This pull request refactors the Multi-Head Latent Attention (MLA) logic out of the generic Attention class and into a new, dedicated MLAAttention class. This is a good step towards better code organization and separation of concerns. The changes in vllm/attention/layer.py and vllm/model_executor/layers/mla.py correctly remove the old MLA logic and adopt the new class. However, the new MLAAttention class in vllm/model_executor/layers/mla_attention.py has critical implementation issues. It fails to properly instantiate and call the attention backend, and it lacks the necessary integration with the KV cache and attention metadata management. These issues will prevent the MLA feature from functioning. I've left detailed comments on how to address these critical problems.

vllm/model_executor/layers/mla_attention.py

ProExpertProg

A few minor notes

vllm/model_executor/layers/mla.py

vllm/model_executor/layers/mla_attention.py

vllm/model_executor/layers/mla.py

therealnaveenkamal · 2025-09-20T00:04:08Z

@ProExpertProg i'm working on unified_mla_attention ops - how do you want it to be? any inputs would be helpful.

ProExpertProg · 2025-09-20T00:08:02Z

Yeah to start they can just mimic the unified_attention and unified_attention_with_output ops. Also please keep the existing MLAAttentionWrapper as is and make the new MLAAttention layer the same in scope as Attention (no rope, no o_proj, etc.)

therealnaveenkamal · 2025-09-20T02:14:13Z

Hi @ProExpertProg, thanks for the feedback.

I've added the unified_mla_attention and unified_mla_attention_with_output ops, which mimic the existing unified attention ops.

MLAAttention layer has been created in mla.py...scoped similarly to the base Attention layer and does not handle projections or rotary embeddings.

The MultiHeadLatentAttentionWrapper uses the new MLAAttention layer to handle the core attention logic.

Let me know what you think. Thanks

vllm/attention/layer.py

vllm/model_executor/layers/mla.py

mergify · 2025-09-23T22:52:57Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @therealnaveenkamal.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

therealnaveenkamal · 2025-09-24T04:18:01Z

@ProExpertProg i've resolved all the comments. please let me know if i have to make any changes

ProExpertProg · 2025-09-24T04:26:09Z

Can you fix pre commit please

ProExpertProg

Just one remaining nit

ProExpertProg · 2025-10-08T19:21:03Z

vllm/model_executor/model_loader/utils.py

+    # Initialize post-load attention weights for both Attention and MLA.
+    # NOTE: Happens after other modules so we can easily decompress weights.


mergify · 2025-10-08T19:22:01Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @therealnaveenkamal.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>

Signed-off-by: Naveenraj Kamalakannan <therealnaveenkamal@gmail.com> Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>

…to loader * 'loader' of https://github.com/dsxsteven/vllm_splitPR: (778 commits) [torchao] Add support for ModuleFqnToConfig using regex (vllm-project#26001) Add: Support for multiple hidden layers in Eagle3 (vllm-project#26164) Enable `RMSNorm` substitution for Transformers backend (vllm-project#26353) [Model] Gemma3: Fix GGUF loading and quantization (vllm-project#26189) Bump Flashinfer to v0.4.0 (vllm-project#26326) Update Dockerfile and install runai-model-streamer[gcs] package (vllm-project#26464) [Core] Relax the LoRA max rank (vllm-project#26461) [CI/Build] Fix model nightly tests (vllm-project#26466) [Hybrid]: Decouple Kernel Block Size from KV Page Size (vllm-project#24486) [Core][KVConnector] Propagate all tokens on resumed preemptions (vllm-project#24926) [MM][Doc] Add documentation for configurable mm profiling (vllm-project#26200) [Hardware][AMD] Enable FlexAttention backend on ROCm (vllm-project#26439) [Bugfix] Incorrect another MM data format in vllm bench throughput (vllm-project#26462) [Bugfix] Catch and log invalid token ids in detokenizer #2 (vllm-project#26445) [Minor] Change warning->warning_once in preprocess (vllm-project#26455) [Bugfix] Set the minimum python version for gpt-oss (vllm-project#26392) [Misc] Redact ray runtime env before logging (vllm-project#26302) Separate MLAAttention class from Attention (vllm-project#25103) [Attention] Register FLASHMLA_SPARSE (vllm-project#26441) [Kernels] Modular kernel refactor (vllm-project#24812) ...

Signed-off-by: Naveenraj Kamalakannan <therealnaveenkamal@gmail.com> Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

Signed-off-by: Naveenraj Kamalakannan <therealnaveenkamal@gmail.com> Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>

Signed-off-by: Naveenraj Kamalakannan <therealnaveenkamal@gmail.com> Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>

### What this PR does / why we need it? This is the step 1 of refactoring code to adapt with vllm main, and this pr aligned with vllm-project/vllm@17c540a 1. refactor deepseek to the latest code arch as of vllm-project/vllm@17c540a 2. bunches of fixes due to vllm changes - Fix `AscendScheduler` `__post_init__`, caused by vllm-project/vllm#25075 - Fix `AscendScheduler` init got an unexpected arg `block_size`, caused by vllm-project/vllm#26296 - Fix `KVCacheManager` `get_num_common_prefix_blocks` arg, caused by vllm-project/vllm#23485 - Fix `MLAAttention` import,caused by vllm-project/vllm#25103 - Fix `SharedFusedMoE` import, caused by vllm-project/vllm#26145 - Fix `LazyLoader` improt, caused by vllm-project/vllm#27022 - Fix `vllm.utils.swap_dict_values` improt, caused by vllm-project/vllm#26990 - Fix `Backend` enum import, caused by vllm-project/vllm#25893 - Fix `CompilationLevel` renaming to `CompilationMode` issue introduced by vllm-project/vllm#26355 - Fix fused_moe ops, caused by vllm-project/vllm#24097 - Fix bert model because of `inputs_embeds`, caused by vllm-project/vllm#25922 - Fix MRope because of `get_input_positions_tensor` to `get_mrope_input_positions`, caused by vllm-project/vllm#24172 - Fix `splitting_ops` changes introduced by vllm-project/vllm#25845 - Fix multi-modality changes introduced by vllm-project/vllm#16229 - Fix lora bias dropping issue introduced by vllm-project/vllm#25807 - Fix structured ouput break introduced by vllm-project/vllm#26737 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? CI passed with existing test. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: Icey <1790571317@qq.com> Co-authored-by: Icey <1790571317@qq.com>

Signed-off-by: Naveenraj Kamalakannan <therealnaveenkamal@gmail.com> Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

Signed-off-by: Naveenraj Kamalakannan <therealnaveenkamal@gmail.com> Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>

therealnaveenkamal requested a review from LucasWilkinson as a code owner September 17, 2025 22:06

therealnaveenkamal changed the title ~~Separate MLAAttention class from, Attention (needs Review)~~ Separate MLAAttention class from Attention (needs Review) Sep 17, 2025

gemini-code-assist bot reviewed Sep 17, 2025

View reviewed changes

vllm/model_executor/layers/mla_attention.py Outdated Show resolved Hide resolved

vllm/model_executor/layers/mla_attention.py Outdated Show resolved Hide resolved

ProExpertProg reviewed Sep 18, 2025

View reviewed changes

vllm/model_executor/layers/mla.py Show resolved Hide resolved

vllm/model_executor/layers/mla_attention.py Outdated Show resolved Hide resolved

MatthewBonanni reviewed Sep 18, 2025

View reviewed changes

vllm/model_executor/layers/mla_attention.py Outdated Show resolved Hide resolved

LucasWilkinson reviewed Sep 18, 2025

View reviewed changes

vllm/model_executor/layers/mla.py Outdated Show resolved Hide resolved

mergify bot added the deepseek Related to DeepSeek models label Sep 19, 2025

therealnaveenkamal requested review from WoosukKwon, hmellor, houseroad, mgoin, robertgshaw2-redhat, simon-mo, tlrmchlsmth, yewentao256 and youkaichao as code owners September 20, 2025 02:03

ProExpertProg reviewed Sep 22, 2025

View reviewed changes

vllm/attention/layer.py Outdated Show resolved Hide resolved

ProExpertProg reviewed Sep 22, 2025

View reviewed changes

vllm/model_executor/layers/mla.py Outdated Show resolved Hide resolved

ProExpertProg reviewed Sep 22, 2025

View reviewed changes

vllm/model_executor/layers/mla.py Outdated Show resolved Hide resolved

mergify bot added the needs-rebase label Sep 23, 2025

therealnaveenkamal force-pushed the mla_attn branch from a19c360 to cfe6f46 Compare September 24, 2025 04:15

mergify bot removed the needs-rebase label Sep 24, 2025

therealnaveenkamal changed the title ~~Separate MLAAttention class from Attention (needs Review)~~ Separate MLAAttention class from Attention Sep 24, 2025

Merge branch 'main' into mla_attn

dca6734

ProExpertProg added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 8, 2025

ProExpertProg approved these changes Oct 8, 2025

View reviewed changes

mergify bot added the needs-rebase label Oct 8, 2025

Merge branch 'main' into mla_attn

2ddf547

Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>

mergify bot removed the needs-rebase label Oct 8, 2025

Remove unnecessary blank line in layer.py

b52ac89

Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>

ProExpertProg enabled auto-merge (squash) October 8, 2025 19:23

simon-mo disabled auto-merge October 9, 2025 00:11

simon-mo merged commit e614ab7 into vllm-project:main Oct 9, 2025
57 of 59 checks passed

This was referenced Oct 10, 2025

[Refactor][MLA]: Independently pass q_nope & q_rope #26567

Open

[Refactor][MLA]: Independently passing q_nope & q_rope #26568

Open

This was referenced Oct 14, 2025

[Bugfix] DeepSeek V3.2 MTP metadata & CUDA graph issues #26779

Open

[Bug]: DeepSeek V3.2 running with MTP will cause DeepseekV32IndexerMetadata parsing error #26711

Open

This was referenced Oct 16, 2025

[CI] Upgrade vllm to newest commit vllm-project/vllm-ascend#3423

Closed

[CI] Upgrade vllm to 0.11.1 vllm-project/vllm-ascend#3499

Closed

roikoren755 mentioned this pull request Oct 21, 2025

[Bug]: Cache malformation in hybrid models with SSM cache dtype float32 and block allocation wrap around #27264

Open

1 task

MengqingCao mentioned this pull request Oct 22, 2025

[1/N][Refactor] Refactor code to adapt with vllm main vllm-project/vllm-ascend#3612

Merged

MatthewBonanni mentioned this pull request Oct 24, 2025

[Attention] Add missing kv cache scale setup #27490

Merged

5 tasks

		# Initialize post-load attention weights for both Attention and MLA.
		# NOTE: Happens after other modules so we can easily decompress weights.

Uh oh!

Separate MLAAttention class from Attention #25103

Separate MLAAttention class from Attention #25103

Uh oh!

Conversation

therealnaveenkamal commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Uh oh!

github-actions bot commented Sep 17, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

therealnaveenkamal commented Sep 20, 2025

Uh oh!

ProExpertProg commented Sep 20, 2025

Uh oh!

therealnaveenkamal commented Sep 20, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Sep 23, 2025

Uh oh!

therealnaveenkamal commented Sep 24, 2025

Uh oh!

ProExpertProg commented Sep 24, 2025

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

ProExpertProg Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Oct 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

therealnaveenkamal commented Sep 17, 2025 •

edited

Loading