[Bugfix] DeepSeek V3.2 MTP metadata & CUDA graph issues #26779

xiaohajiayou · 2025-10-14T07:55:53Z

Purpose

Fix CUDA Graph Capture Crash in EAGLE
Resolve IndexError: list index out of range when accessing self.cudagraph_batch_sizes[-1] during CUDA Graph capture
- Root Cause:
  - When using PIECEWISE capture, gpu_model_runner passes use_cudagraphs=True, but MTP heads of draft models (e.g., DeepSeek MTP) lack native CUDA graph support.
  - EAGLE sets self.use_cuda_graph = False and initializes self.cudagraph_batch_sizes as an empty list for such models.
  - PR [Spec-Decode] Support piecewise cudagraphs for Eagle head #25109 ignored self.use_cuda_graph in dummy_run arguments, leading to out-of-range access of the empty cudagraph_batch_sizes list.
- Change: In vllm/v1/spec_decode/eagle.py:dummy_run, gate padding and cudagraph_runtime_mode with (use_cudagraphs and self.use_cuda_graph) to prevent empty-list indexing and align with runner behavior.
Fix DeepSeek V3.2 MTP Metadata & Sparse MLA Layer Selection
Resolve incorrect layer classification and metadata mapping for DeepSeek V3.2 Sparse MLA.
- Root Cause:
  - PR Separate MLAAttention class from Attention #25103 Separate MLAAttention class from Attention, but the Indexer layers were not excluded from attention layer lists, causing mismatches between MTP metadata and model components.
  - Sparse MLA architectures require strict separation of attention and indexer layers to ensure proper metadata propagation for expert routing.
- Change: In vllm/v1/spec_decode/eagle.py:load_model, filter indexer layers from attention layer lists using draft_attn_layer_names - draft_indexer_layer_names, ensuring accurate metadata selection for DeepSeek MTP heads.

Test Plan

For the scenarios described in Issue #26711 (DeepSeek V3.2 MTP metadata anomalies and CUDA graph capture crashes in EAGLE mode), run tests with and without the fix applied to verify if the issues recur.

Test Result

Verification for Issue #26711 has been completed. After the fix: DeepSeek V3.2 MTP metadata mapping works normally, and there are no crashes in CUDA graph capture under EAGLE mode. The original issues have been resolved.

github-actions · 2025-10-14T07:56:02Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

gemini-code-assist

Code Review

This pull request correctly fixes an IndexError in dummy_run that occurs when a drafter does not support CUDA graphs. The change prevents a crash by ensuring self.use_cuda_graph is checked before accessing CUDA graph configurations. I've added a suggestion to further improve the robustness of this fix by also checking if self.cudagraph_batch_sizes is non-empty, which handles an additional edge case and improves code readability by reducing duplication.

vllm/v1/spec_decode/eagle.py

benchislett · 2025-10-15T16:49:23Z

vllm/v1/spec_decode/eagle.py

    ) -> None:
-        if use_cudagraphs and num_tokens <= self.cudagraph_batch_sizes[-1]:
+        # Determine if CUDA graphs should be used for this run.
+        cudagraphs_enabled = (


I think it would be easier to just set self.use_cuda_graph = self.use_cuda_graph and bool(self.cudagraph_batch_sizes) in __init__ since they are both unchanging over the lifetime of the object I think.

What is the scenario where use_cuda_graph is True and cudagraph_batch_sizes is empty? I wonder if this might be a symptom of a deeper issue

Thanks for the suggestion! I applied it and finalized the flag in __init__, and all runtime gating now only checks this flag.

Potential scenarios include: after configuration initialization, cudagraph_mode is overridden to include PIECEWISE mode while the capture size list remains empty—such as when the model's enforce_eager mechanism blocks the size generation process, users explicitly configure an empty list, or sizes are filtered to have no valid entries.

It seems this situation should not occur at present. The addition of this check field is mainly to ensure no out-of-bounds issues arise, while making the drafter's behavior more consistent and safe. @benchislett

@xiaohajiayou could you check if #26821 solves the issue, or if this additional change is also necessary?

I’d lean toward keeping it. #26821 already handled the “drafter forces eager ⇒ empty cudagraph_capture_sizes” situation. What’s left are the cases where self.use_cuda_graph still flips to True while self.cudagraph_batch_sizes ends up empty—for example:

someone starts vLLM with --compilation-config '{"level": 3, "cudagraph_capture_sizes": []}';

the default capture sizes get filtered away by max_num_batched_tokens, sequence parallel, or similar constraints.

In both scenarios, the guard self.use_cuda_graph &= bool(self.cudagraph_batch_sizes) automatically disables the drafter’s CUDA graph when the list is empty. The drafter simply falls back to eager, which costs the graph speedup but keeps the model running instead of crashing on an index error.

That trade-off’s much friendlier than a hard failure.

xiaohajiayou · 2025-10-18T14:50:09Z

The issue referenced in #26711 is now fixed, with no test issues. Mind reviewing if we can merge this? @benchislett @luccafong

benchislett · 2025-10-21T18:46:54Z

vllm/v1/spec_decode/eagle.py

        )
        draft_indexer_layer_names = indexer_layers.keys() - target_indexer_layer_names
-        self.attn_layer_names = list(draft_attn_layer_names)
+        self.attn_layer_names = list(draft_attn_layer_names - draft_indexer_layer_names)


could you help me understand how attn_layer_names is used, and why draft_indexer_layer_names must be excluded?

Sure! Here’s how the pieces fit together:

When we build the drafter layers in

vllm/vllm/v1/spec_decode/eagle.py

Lines 934 to 940 in ab3e800

draft_attn_layer_names = (

get_layers_from_vllm_config(self.vllm_config, AttentionLayerBase).keys()

- target_attn_layer_names

)

indexer_layers = get_layers_from_vllm_config(

self.vllm_config, DeepseekV32IndexerCache

)

we grab every module that inherits AttentionLayerBase. DeepSeek’s Lightning Indexer (DeepseekV32IndexerCache) does that too, so its layer names got lumped into draft_attn_layer_names.

Later, _get_attention_metadata_builder looks at the very first name in self.attn_layer_names, finds the backend for that layer, and caches its metadata builder（might be indexer metadata builder） for the whole set.

vllm/vllm/v1/spec_decode/eagle.py

Lines 1078 to 1088 in ab3e800

def _get_attention_metadata_builder(self) -> AttentionMetadataBuilder:

"""Find and return the attention metadata builders for EAGLE layers.

Returns:

The metadata builders for EAGLE layers.

Raises:

AssertionError: If no metadata builders are found for EAGLE layers.

"""

builder = None

chosen_layer = self.attn_layer_names[0]

.

After that we loop over self.attn_layer_names and hand that builder’s output to every entry

vllm/vllm/v1/spec_decode/eagle.py

Lines 256 to 262 in ab3e800

per_layer_attn_metadata = {}

for layer_name in self.attn_layer_names:

per_layer_attn_metadata[layer_name] = attn_metadata

for layer_name in self.indexer_layer_names:

assert draft_indexer_metadata is not None

per_layer_attn_metadata[layer_name] = draft_indexer_metadata

The snag is that Lightning Indexer expects DeepseekV32IndexerMetadata, produced by DeepseekV32IndexerMetadataBuilder .

vllm/vllm/v1/attention/backends/mla/indexer.py

Line 215 in ab3e800

class DeepseekV32IndexerMetadataBuilder(AttentionMetadataBuilder):

the standard attention backends expect completely different metadata. If an indexer sneaks into self.attn_layer_names, _get_attention_metadata_builder can lock onto the indexer backend, and the loop then feeds indexer metadata to the true attention layers while the indexer never reaches its dedicated path .

So we subtract draft_indexer_layer_names when we finalize self.attn_layer_names , like:
self.attn_layer_names = list(draft_attn_layer_names - draft_indexer_layer_names)
That guarantees the first entry really is an attention layer. _get_attention_metadata_builder picks the correct attention backend, the drafter attention layers share that metadata as intended, and the indexer layers stay in self.indexer_layer_names, where they go through the metadata builder their backend expects.

Hope that clarifies why draft_indexer_layer_names has to be excluded.

benchislett

LGTM, thanks!

Signed-off-by: xiaohajiayou <923390377@qq.com>

…dagraphs_enabled Signed-off-by: xiaohajiayou <923390377@qq.com>

Signed-off-by: xiaohajiayou <923390377@qq.com>

xiaohajiayou · 2025-10-27T15:44:59Z

All CI checks have passed, and the issues in Issue #26711 are resolved—can we merge this PR and close the corresponding issue?
@benchislett

xiaohajiayou requested review from benchislett and luccafong as code owners October 14, 2025 07:55

mergify bot added deepseek Related to DeepSeek models speculative-decoding v1 labels Oct 14, 2025

gemini-code-assist bot reviewed Oct 14, 2025

View reviewed changes

vllm/v1/spec_decode/eagle.py Outdated Show resolved Hide resolved

xiaohajiayou force-pushed the fix-eagle-dummy-run branch 3 times, most recently from 746c47f to 47a4376 Compare October 14, 2025 12:03

xiaohajiayou changed the title ~~Fix IndexError in dummy_run when drafter（DeepSeek v32） doesn’t support CUDA graphs.~~ [Bugfix] DeepSeek V3.2 MTP metadata & CUDA graph issues Oct 14, 2025

xiaohajiayou force-pushed the fix-eagle-dummy-run branch from 47a4376 to 3154610 Compare October 14, 2025 12:21

benchislett reviewed Oct 15, 2025

View reviewed changes

xiaohajiayou force-pushed the fix-eagle-dummy-run branch 2 times, most recently from f764463 to d92ba15 Compare October 16, 2025 03:33

xiaohajiayou requested a review from benchislett October 16, 2025 07:20

benchislett reviewed Oct 21, 2025

View reviewed changes

xiaohajiayou requested a review from benchislett October 23, 2025 01:41

benchislett approved these changes Oct 23, 2025

View reviewed changes

benchislett added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 23, 2025

xiaohajiayou added 3 commits October 24, 2025 10:50

Fix IndexError in dummy_run when drafter doesn’t support CUDA graphs.

5fc3033

Signed-off-by: xiaohajiayou <923390377@qq.com>

v1/spec_decode/eagle: make dummy_run CUDA graph gating robust with cu…

6eacbb5

…dagraphs_enabled Signed-off-by: xiaohajiayou <923390377@qq.com>

Filter Indexer layers, fix metadata selection for DeepSeek Sparse MLA

becdd3d

Signed-off-by: xiaohajiayou <923390377@qq.com>

xiaohajiayou force-pushed the fix-eagle-dummy-run branch from d92ba15 to becdd3d Compare October 24, 2025 02:50

xiaohajiayou added 2 commits October 24, 2025 13:43

Merge branch 'main' into fix-eagle-dummy-run

50e3347

Merge branch 'main' into fix-eagle-dummy-run

afe723f

	draft_attn_layer_names = (
	get_layers_from_vllm_config(self.vllm_config, AttentionLayerBase).keys()
	- target_attn_layer_names
	)
	indexer_layers = get_layers_from_vllm_config(
	self.vllm_config, DeepseekV32IndexerCache
	)

	def _get_attention_metadata_builder(self) -> AttentionMetadataBuilder:
	"""Find and return the attention metadata builders for EAGLE layers.

	Returns:
	The metadata builders for EAGLE layers.

	Raises:
	AssertionError: If no metadata builders are found for EAGLE layers.
	"""
	builder = None
	chosen_layer = self.attn_layer_names[0]

	per_layer_attn_metadata = {}
	for layer_name in self.attn_layer_names:
	per_layer_attn_metadata[layer_name] = attn_metadata

	for layer_name in self.indexer_layer_names:
	assert draft_indexer_metadata is not None
	per_layer_attn_metadata[layer_name] = draft_indexer_metadata

Uh oh!

Uh oh!

[Bugfix] DeepSeek V3.2 MTP metadata & CUDA graph issues #26779

Are you sure you want to change the base?

[Bugfix] DeepSeek V3.2 MTP metadata & CUDA graph issues #26779

Conversation

xiaohajiayou commented Oct 14, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

github-actions bot commented Oct 14, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

benchislett Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

benchislett Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

xiaohajiayou Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiaohajiayou Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benchislett Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

xiaohajiayou Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xiaohajiayou commented Oct 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benchislett Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

xiaohajiayou Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benchislett left a comment

Choose a reason for hiding this comment

Uh oh!

xiaohajiayou commented Oct 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xiaohajiayou commented Oct 14, 2025 •

edited by github-actions bot

Loading

xiaohajiayou Oct 16, 2025 •

edited

Loading

xiaohajiayou Oct 16, 2025 •

edited

Loading

xiaohajiayou Oct 22, 2025 •

edited

Loading

xiaohajiayou commented Oct 18, 2025 •

edited

Loading

xiaohajiayou Oct 22, 2025 •

edited

Loading