[BugFix] Temporary fix for IMA with MTP = 2 and full-cg #28315

LucasWilkinson · 2025-11-07T18:35:37Z

Temporary fix for #28207

For now just make sure when spec-decode is enabled that the cudagraph shapes are evenly divisible by 1 + num_speculative_tokens; see #28207 for more details

Test 1

Tested with

MODEL=deepseek-ai/DeepSeek-R1
VLLM_ATTENTION_BACKEND="FLASHINFER_MLA" \
VLLM_USE_V1=1 \
vllm serve $MODEL \
--tensor-parallel-size 8 \
--disable-log-requests \
--no-enable-prefix-caching \
--compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \
--trust-remote-code \
--block-size 64 \
--kv-cache-dtype fp8 \
--speculative-config='{"method": "deepseek_mtp", "num_speculative_tokens": 2}'

MODEL=deepseek-ai/DeepSeek-R1
lm_eval \
    --model local-completions \
    --tasks gsm8k \
    --model_args model=${MODEL},base_url=http://127.0.0.1:8000/v1/completions \
    --batch_size 100

Hits an IMA on main but not on this branch

Test 2

vllm serve deepseek-ai/DeepSeek-R1 \
  --tensor-parallel-size 8 \
  --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE", "pass_config": {"enable_sequence_parallelism": true}}' \
  --speculative-config='{"method": "deepseek_mtp", "num_speculative_tokens": 2}'

Correctly fails with

ValueError: Can't determine cudagraph shapes that are both a multiple of 3 
(num_speculative_tokens + 1) required by spec-decode and 8 (tensor_parallel_size) 
required by sequence parallelism please adjust num_speculative_tokens or disable 
sequence parallelism

gemini-code-assist

Code Review

This pull request introduces a temporary fix for an issue with multi-token prediction and full CUDA graphs by adjusting CUDA graph capture sizes. The core logic change is in a new method, adjust_cudagraph_sizes_to_be_multipe_of, which unfortunately contains a critical bug that can lead to runtime errors and incorrect behavior. I've provided a detailed review comment with a suggested fix for this issue.

gemini-code-assist · 2025-11-07T18:37:44Z

vllm/config/vllm.py

+    def adjust_cudagraph_sizes_to_be_multipe_of(self, multiple_of: int):
+        new_sizes = sorted(
+            [
+                round_up(size, multiple_of)
+                for size in self.compilation_config.cudagraph_capture_sizes
+            ]
+        )
+        if new_sizes[-1] > self.compilation_config.max_cudagraph_capture_size:
+            new_sizes = new_sizes[:-1]
+        self.compilation_config.max_cudagraph_capture_size = new_sizes[-1]
+        self.compilation_config.cudagraph_capture_sizes = new_sizes


The current implementation of adjust_cudagraph_sizes_to_be_multipe_of has several critical issues that can lead to incorrect behavior or runtime errors:

Potential IndexError: If all cudagraph_capture_sizes, when rounded up, exceed max_cudagraph_capture_size, the new_sizes list can become empty after the if condition, leading to an IndexError on new_sizes[-1]. For example, if cudagraph_capture_sizes is [16], max_cudagraph_capture_size is 16, and multiple_of is 20, new_sizes becomes [20]. The if condition is met, and new_sizes is modified to [], causing a crash on the next line.

Incorrect Filtering: The logic if new_sizes[-1] > ...: new_sizes = new_sizes[:-1] only checks and removes the largest element. If multiple rounded-up sizes exceed max_cudagraph_capture_size, the smaller ones will incorrectly remain in the list.

Incorrect max_cudagraph_capture_size update: The max_cudagraph_capture_size can be updated to a value larger than its original value, which seems to contradict its purpose as a hard limit derived from scheduler and token configurations.

I suggest a more robust implementation that correctly filters the sizes and handles edge cases gracefully.

Additionally, there is a typo in the method name (multipe_of should be multiple_of). I've kept it in the suggestion to match the current code, but it should be corrected here and at the call site.

def adjust_cudagraph_sizes_to_be_multipe_of(self, multiple_of: int): max_size = self.compilation_config.max_cudagraph_capture_size # Use a set to handle duplicates from rounding up rounded_sizes = { round_up(size, multiple_of) for size in self.compilation_config.cudagraph_capture_sizes } new_sizes = sorted([s for s in rounded_sizes if s <= max_size]) if not new_sizes: # All rounded-up sizes exceeded the max size. # Disable cudagraphs by setting sizes to empty. self.compilation_config.max_cudagraph_capture_size = 0 self.compilation_config.cudagraph_capture_sizes = [] return self.compilation_config.max_cudagraph_capture_size = new_sizes[-1] self.compilation_config.cudagraph_capture_sizes = new_sizes

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

vllm/config/compilation.py

gemini-code-assist

Code Review

This pull request refactors the computation of bs_to_padded_graph_size and introduces logic to adjust CUDA graph capture sizes. While the intent is to fix an issue with speculative decoding, the changes introduce two critical bugs. First, the refactoring of bs_to_padded_graph_size computation breaks the model initialization order, as it's now computed after profile_run which depends on it. Second, the new method to adjust capture sizes contains a typo and is vulnerable to an IndexError if it results in an empty list of sizes. I have provided detailed comments and suggestions to fix these critical issues.

gemini-code-assist · 2025-11-07T18:40:09Z

Warning

Gemini encountered an error creating the review. You can try again by commenting /gemini review.

Temp fix for vllm-project#28207 Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

vllm/config/compilation.py

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

…/neuralmagic/vllm into lwilkinson/tmp-full-cg-mtp-2-fix

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

ProExpertProg

This might not be compatible with sequence parallelism but that's for high-throughput cases anyway, just might be worth adding a warning.

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

LucasWilkinson · 2025-11-16T21:39:26Z

This might not be compatible with sequence parallelism but that's for high-throughput cases anyway, just might be worth adding a warning.

Done 👍

LucasWilkinson requested review from ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, robertgshaw2-redhat, simon-mo, tlrmchlsmth, yewentao256 and youkaichao as code owners November 7, 2025 18:35

mergify bot added the v1 label Nov 7, 2025

LucasWilkinson marked this pull request as draft November 7, 2025 18:36

gemini-code-assist bot reviewed Nov 7, 2025

View reviewed changes

chatgpt-codex-connector bot reviewed Nov 7, 2025

View reviewed changes

vllm/config/compilation.py Show resolved Hide resolved

gemini-code-assist bot reviewed Nov 7, 2025

View reviewed changes

This was referenced Nov 8, 2025

[Bug]: CUDA Graph Capture Issue: Unexpected Prefill Branches in Uniform Decode Graphs when MTP=2 #28207

Open

[Spec-decode] Refoctor cudagraphs for spec-decode;support uniform_alignment of cudagraph sizes. #23679

Open

LucasWilkinson changed the title ~~[WIP] Tmp fix for IMA with MTP = 2 and full-cg~~ [BugFix] Temporary fix for IMA with MTP = 2 and full-cg Nov 12, 2025

LucasWilkinson marked this pull request as ready for review November 12, 2025 04:41

LucasWilkinson added 5 commits November 11, 2025 20:41

[WIP] Tmp fix for IMA with MTP = 2 and full-cg

6e9926d

Temp fix for vllm-project#28207 Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

cleanup

00c4e13

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

wip

5d67c78

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

wip

60bb901

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

cleanup

529078e

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

LucasWilkinson force-pushed the lwilkinson/tmp-full-cg-mtp-2-fix branch from 5c36137 to 529078e Compare November 12, 2025 04:41

chatgpt-codex-connector bot reviewed Nov 12, 2025

View reviewed changes

vllm/config/compilation.py Show resolved Hide resolved

LucasWilkinson added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 13, 2025

LucasWilkinson and others added 2 commits November 14, 2025 00:23

Merge branch 'main' into lwilkinson/tmp-full-cg-mtp-2-fix

5e5ef72

fix tests

bbf5473

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

LucasWilkinson and others added 4 commits November 14, 2025 16:45

Merge branch 'lwilkinson/tmp-full-cg-mtp-2-fix' of https://github.com…

94e6e8e

…/neuralmagic/vllm into lwilkinson/tmp-full-cg-mtp-2-fix

fix

3f3848d

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

Merge branch 'main' into lwilkinson/tmp-full-cg-mtp-2-fix

020cae1

fix tests

a1ecdc6

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

LucasWilkinson added this to the v0.11.1 milestone Nov 15, 2025

ProExpertProg approved these changes Nov 16, 2025

View reviewed changes

add warning

ea2e649

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

mgoin approved these changes Nov 17, 2025

View reviewed changes

mgoin merged commit 64e39d6 into vllm-project:main Nov 17, 2025
49 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[BugFix] Temporary fix for IMA with MTP = 2 and full-cg #28315

[BugFix] Temporary fix for IMA with MTP = 2 and full-cg #28315

LucasWilkinson commented Nov 7, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 7, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot commented Nov 7, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

ProExpertProg left a comment

Uh oh!

LucasWilkinson commented Nov 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

[BugFix] Temporary fix for IMA with MTP = 2 and full-cg #28315

[BugFix] Temporary fix for IMA with MTP = 2 and full-cg #28315

Conversation

LucasWilkinson commented Nov 7, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test 1

Test 2

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot commented Nov 7, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson commented Nov 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

LucasWilkinson commented Nov 7, 2025 •

edited by github-actions bot

Loading