[Misc] Refactor `get_kv_cache_spec` into `AttentionLayerBase` #26587

NickLucche · 2025-10-10T13:23:49Z

This PR modifies the AttentionLayerBase interface to add a new get_kv_cache_spec method.
This allows different attention layers to define their own KV Cache spec, by making the spec entirely transparent to the Model Runner.

As a consequence, the runner can now limit itself to collect the specs without having to handle different attention types and/or model-specific hacks such as the one for DSv32 Indexer.
It also makes the code much simpler as all ENCODER,ENCODER_ONLY and ENCODER_DECODER type management is moved to a method dispatch system.

cc @heheda12345 who clearly defined the task

PS this used to be a TODO in code from @LucasWilkinson https://github.com/vllm-project/vllm/blob/releases/v0.11.0/vllm/v1/worker/gpu_model_runner.py#L4065

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

vllm/v1/worker/gpu_model_runner.py

vllm/attention/layers/chunked_local_attention.py

heheda12345 · 2025-10-10T14:45:40Z

vllm/attention/layers/encoder_only_attention.py

            **kwargs,
        )
+
+    def get_kv_cache_spec(self, vllm_config: VllmConfig) -> Optional[KVCacheSpec]:


Note that we also have EncoderOnlyAttentionSpec. We skip it in get_kv_cache_spec because these layers doesn't need kv cache, but add it back in may_add_encoder_only_layers_to_kv_cache_config as we need to build attention metadata for these layers. Can you try to make a better abstraction (leaving it as it is now may also be fine if no better idea)

vllm/v1/worker/gpu_model_runner.py

heheda12345 · 2025-10-10T14:53:39Z

vllm/v1/worker/gpu_model_runner.py

-                )
-
-        ds_indexer_layers = get_layers_from_vllm_config(
-            self.vllm_config, DeepseekV32IndexerCache


do you need to implement get_kv_cache_spec for DeepseekV32IndexerCache?

it already had one, just changed its signature

NickLucche · 2025-10-10T15:48:08Z

@heheda12345 bridging discussion here

Note that we also have EncoderOnlyAttentionSpec. We skip it in get_kv_cache_spec because these layers doesn't need kv cache, but add it back in may_add_encoder_only_layers_to_kv_cache_config as we need to build attention metadata for these layers. Can you try to make a better abstraction (leaving it as it is now may also be fine if no better idea)

So I think the fact that we need need different specs for worker/scheduler side is a bit of a nuisance here, as I wouldn't want a very generic interface such as the Attention one to be aware of that. Same thing for having interface methods only called from worker.
Hence as long as a specific branching is needed for encoder-only, I think the current may_add_encoder_only_layers_to_kv_cache_config setup is clearer.

Taking a step back, can we avoid having different worker<>scheduler specs for encoder-only in the first place?

LucasWilkinson · 2025-10-10T16:55:20Z

vllm/attention/layer.py

+        kv_cache_dtype = kv_cache_dtype_str_to_dtype(
+            self.kv_cache_dtype, vllm_config.model_config.dtype
+        )
+        return MLAAttentionSpec(


I dont think this is the right spec; the MLA spec should be used for:

vllm/vllm/attention/layer.py

Line 567 in 96ad65b

class MLAAttention(nn.Module, AttentionLayerBase):

I think this layer (MultiHeadAttention) multimodal models but tbh im not exactly sure where it is used

@LucasWilkinson not sure what's wrong with github preview, this change is actually to the
class MLAAttention(nn.Module, AttentionLayerBase):

MHA is a simple nn.Module, it doesn't even implement the attn interface

like if you go to line 815 you can see it belongs to the mla class

oh sorry ya you are correct my bad! thats really weird that the GitHub preview didn't show that 🤔

ILikeIneine · 2025-10-11T02:28:40Z

vllm/v1/spec_decode/eagle.py

 from vllm.model_executor.layers.attention_layer_base import AttentionLayerBase
 from vllm.model_executor.model_loader import get_model
 from vllm.model_executor.models import supports_multimodal
 from vllm.model_executor.models.deepseek_v2 import DeepseekV32IndexerCache


I noticed here the instance are loaded directly from vllm's models, which is not registered by plugin. Would this be possible to fetch from Model Registry?

nice catch, I think it's outside the scope of the PR but we have to change that too

NickLucche · 2025-10-14T12:27:42Z

@heheda12345 @LucasWilkinson gentle ping on this one

LucasWilkinson · 2025-10-15T04:51:29Z

@heheda12345 @LucasWilkinson gentle ping on this one

Can you please address: https://github.com/vllm-project/vllm/pull/26587/files#r2421239326

Otherwise LGTM

vllm/v1/worker/gpu_model_runner.py

NickLucche · 2025-10-10T15:13:56Z

vllm/model_executor/models/deepseek_v2.py

        compilation_config.static_forward_context[prefix] = self

-    def get_kv_cache_spec(self) -> KVCacheSpec:
+    def get_kv_cache_spec(self, vllm_config: VllmConfig) -> Optional[KVCacheSpec]:


@heheda12345 deepseek change

NickLucche · 2025-10-12T10:27:27Z

vllm/attention/layer.py

+        kv_cache_dtype = kv_cache_dtype_str_to_dtype(
+            self.kv_cache_dtype, vllm_config.model_config.dtype
+        )
+        return MLAAttentionSpec(


@LucasWilkinson not sure what's wrong with github preview, this change is actually to the
class MLAAttention(nn.Module, AttentionLayerBase):

MHA is a simple nn.Module, it doesn't even implement the attn interface

NickLucche · 2025-10-12T10:28:22Z

vllm/attention/layer.py

+        kv_cache_dtype = kv_cache_dtype_str_to_dtype(
+            self.kv_cache_dtype, vllm_config.model_config.dtype
+        )
+        return MLAAttentionSpec(


like if you go to line 815 you can see it belongs to the mla class

NickLucche · 2025-10-12T10:29:46Z

vllm/v1/spec_decode/eagle.py

 from vllm.model_executor.layers.attention_layer_base import AttentionLayerBase
 from vllm.model_executor.model_loader import get_model
 from vllm.model_executor.models import supports_multimodal
 from vllm.model_executor.models.deepseek_v2 import DeepseekV32IndexerCache


nice catch, I think it's outside the scope of the PR but we have to change that too

NickLucche · 2025-10-15T10:58:50Z

@LucasWilkinson uh, somehow I didn't send the response to your comment last week... basically I am just saying MHA wasn't edited, I am not sure why github shows it like that, but only MLAAttention got changed

mergify · 2025-10-15T15:36:53Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @NickLucche.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

LucasWilkinson

LGTM

Signed-off-by: NickLucche <nlucches@redhat.com>

…roject#26587) Signed-off-by: NickLucche <nlucches@redhat.com>

…roject#26587) Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>

…roject#26587) Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

…roject#26587) Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>

hmellor · 2025-11-04T04:28:58Z

I've just found that this PR breaks the behaviour of passing attn_type to Attention to select the attention type.

If you instantiate Attention(..., attn_type=AttentionType.ENCODER_ONLY), the new get_kv_cache_spec of Attention will be called which asserts that attn_type == AttentionType.DECODER.

Is this expected?

If yes:

Does this mean that we should explicitly use Attention and EncoderOnlyAttention classes?
Should attn_type be removed from __init__ because it can no longer effectively modify the used attention type?

heheda12345 · 2025-11-04T05:38:49Z

Nice catch! I think we can always use EncoderOnlyAttention and start the deprecation of attn_type. As the Atttention class is used by too many people, I prefer to add a deprecation warning first and remove it in a future release.

NickLucche marked this pull request as ready for review October 10, 2025 13:23

NickLucche requested review from LucasWilkinson, WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat, tdoublep and ywang96 as code owners October 10, 2025 13:23

mergify bot added deepseek Related to DeepSeek models v1 labels Oct 10, 2025

chatgpt-codex-connector bot reviewed Oct 10, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Show resolved Hide resolved

vllm/attention/layers/chunked_local_attention.py Outdated Show resolved Hide resolved

heheda12345 reviewed Oct 10, 2025

View reviewed changes

NickLucche requested review from benchislett and luccafong as code owners October 10, 2025 15:20

mergify bot added the speculative-decoding label Oct 10, 2025

LucasWilkinson reviewed Oct 10, 2025

View reviewed changes

ILikeIneine reviewed Oct 11, 2025

View reviewed changes

NickLucche commented Oct 15, 2025

View reviewed changes

mergify bot added the needs-rebase label Oct 15, 2025

NickLucche force-pushed the get-kvcache-spec-refactor branch from e9cbbb9 to d15cc44 Compare October 16, 2025 13:20

mergify bot removed the needs-rebase label Oct 16, 2025

LucasWilkinson approved these changes Oct 16, 2025

View reviewed changes

LucasWilkinson enabled auto-merge (squash) October 16, 2025 13:59

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 16, 2025

NickLucche force-pushed the get-kvcache-spec-refactor branch from 3045ede to 973da05 Compare October 17, 2025 16:28

NickLucche added 8 commits October 18, 2025 11:50

refactor

d7f5671

Signed-off-by: NickLucche <nlucches@redhat.com>

review

97bdf52

Signed-off-by: NickLucche <nlucches@redhat.com>

mypy

2bdeeae

Signed-off-by: NickLucche <nlucches@redhat.com>

precommit

afe08c0

Signed-off-by: NickLucche <nlucches@redhat.com>

precommit

ff3b89b

Signed-off-by: NickLucche <nlucches@redhat.com>

default unit test kv cache dtype

413c104

Signed-off-by: NickLucche <nlucches@redhat.com>

new subclass

b409658

Signed-off-by: NickLucche <nlucches@redhat.com>

refresh block_Size

b615111

Signed-off-by: NickLucche <nlucches@redhat.com>

NickLucche force-pushed the get-kvcache-spec-refactor branch from 395442e to b615111 Compare October 18, 2025 11:52

NickLucche added 2 commits October 18, 2025 11:56

precommit

8482966

Signed-off-by: NickLucche <nlucches@redhat.com>

cruft

f8e90b7

Signed-off-by: NickLucche <nlucches@redhat.com>

LucasWilkinson merged commit b26b70b into vllm-project:main Oct 18, 2025
58 checks passed

heheda12345 mentioned this pull request Oct 20, 2025

[Feature]: Lazy import DeepseekV32IndexerCache #26794

Open

1 task

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025

[Misc] Refactor get_kv_cache_spec into AttentionLayerBase (vllm-p…

aa8674e

…roject#26587) Signed-off-by: NickLucche <nlucches@redhat.com>

adabeyta pushed a commit to adabeyta/vllm that referenced this pull request Oct 20, 2025

[Misc] Refactor get_kv_cache_spec into AttentionLayerBase (vllm-p…

bd03423

…roject#26587) Signed-off-by: NickLucche <nlucches@redhat.com>

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025

[Misc] Refactor get_kv_cache_spec into AttentionLayerBase (vllm-p…

479ba0f

…roject#26587) Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025

[Misc] Refactor get_kv_cache_spec into AttentionLayerBase (vllm-p…

2afb8b8

…roject#26587) Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

Isotr0py mentioned this pull request Nov 4, 2025

[Bugfix] Fix encoder-only model support for transformers backend #28021

Merged

5 tasks

Uh oh!

[Misc] Refactor get_kv_cache_spec into AttentionLayerBase #26587

[Misc] Refactor get_kv_cache_spec into AttentionLayerBase #26587

Uh oh!

Conversation

NickLucche commented Oct 10, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NickLucche commented Oct 10, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NickLucche commented Oct 14, 2025

Uh oh!

LucasWilkinson commented Oct 15, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NickLucche commented Oct 15, 2025

Uh oh!

mergify bot commented Oct 15, 2025

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hmellor commented Nov 4, 2025

Uh oh!

heheda12345 commented Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[Misc] Refactor `get_kv_cache_spec` into `AttentionLayerBase` #26587

[Misc] Refactor `get_kv_cache_spec` into `AttentionLayerBase` #26587

NickLucche commented Oct 10, 2025 •

edited by github-actions bot

Loading