[Bugfix][Spec Decode] Fix wrong valid_mask for padded speculation when chunked prefill occurs #26231

seven-mile · 2025-10-04T15:25:40Z

Purpose

I encountered an issue similar to #26198, but within EAGLE3, as they share the same code path.

The root cause is an "optimization" for calculating valid_mask for padded speculation, introduced by #24539:

Lines 504 to 512 in 5c057e0

    
           # Generate a mask for all valid tokens within those requests 
        
           max_gen_len = sampled_token_ids.shape[-1] 
        
           if max_gen_len == 1: 
        
               valid_mask = torch.ones_like(valid_sampled_token_ids_gpu, 
        
                                            dtype=torch.bool) 
        
           else: 
        
               valid_mask = ( 
        
                   (valid_sampled_token_ids_gpu != -1) & 
        
                   (valid_sampled_token_ids_gpu < gpu_input_batch.vocab_size))

max_gen_len == 1 usually means speculation does not happen during decode phase. It seems reasonable to have a fast path for a full 1 valid_mask. But note that any prefill input batch would also meet the condition.

When a prefill request is chunked due to exceeding the token budget 8192, it contributes to discard_request_indices (it's surely not supposed to be sampled before all prompt tokens are consumed). But the fast path still yields a valid_mask full of 1, which is incorrect and leads to sentinel values leaking and further OOB.

Test Plan

This change removes a problematic optimization and reverts to the default behavior. I believe no complex tests are necessary.

I tested the fix on 2xH100 using the following commands:

vllm serve \
    Qwen/Qwen3-30B-A3B \
    --host 0.0.0.0 \
    --port 7000 \
    --seed 42 \
    -dp 2 \
    --enable-expert-parallel \
    --enforce-eager \
    --max-model-len 4096 \
    --gpu_memory_utilization 0.8 \
    --speculative-config '{"model":"Tengyunw/qwen3_30b_moe_eagle3","num_speculative_tokens":4}'

vllm bench serve \
  --backend vllm --model Qwen/Qwen3-30B-A3B \
  --dataset-name sharegpt \
  --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
  --num-prompts 200 \
  --host localhost --port 7000

Test Result

It works without any crash.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

… chunked prefill occurs Signed-off-by: seven-mile <i@7li.moe>

gemini-code-assist

Code Review

This pull request addresses a critical bug in speculative decoding where a faulty optimization for padded speculation during chunked prefill could lead to incorrect valid_mask generation. The author correctly identifies that the max_gen_len == 1 condition is not exclusive to the decode phase and can also occur during chunked prefills, causing sentinel values to be treated as valid. The proposed fix, which removes the problematic fast path and reverts to the more general and robust masking logic, is correct and effectively resolves the issue. This change prevents potential out-of-bounds errors and ensures the stability of speculative decoding under these conditions. The reasoning is sound, and the fix is well-justified.

benchislett

LGTM, thanks for the fix!

An alternative way to fix this might be to change the index_fill_ to apply to the mask in the max_gen_len == 1 case. The fix implemented here is also fine and likely just as performant.

benchislett · 2025-10-06T15:50:34Z

Please fix the conflicts with main

mergify · 2025-10-06T15:50:43Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @seven-mile.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

…n chunked prefill occurs (vllm-project#26231) Signed-off-by: seven-mile <i@7li.moe> Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Co-authored-by: Benjamin Chislett <bchislett@nvidia.com>

…n chunked prefill occurs (vllm-project#26231) Signed-off-by: seven-mile <i@7li.moe> Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Co-authored-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

…n chunked prefill occurs (vllm-project#26231) Signed-off-by: seven-mile <i@7li.moe> Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Co-authored-by: Benjamin Chislett <bchislett@nvidia.com>

…n chunked prefill occurs (vllm-project#26231) Signed-off-by: seven-mile <i@7li.moe> Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Co-authored-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

[Bugfix][SpecDecode] Fix wrong valid_mask for padded speculation when…

fba307a

… chunked prefill occurs Signed-off-by: seven-mile <i@7li.moe>

seven-mile requested review from benchislett and luccafong as code owners October 4, 2025 15:25

mergify bot added speculative-decoding v1 labels Oct 4, 2025

gemini-code-assist bot reviewed Oct 4, 2025

View reviewed changes

This was referenced Oct 4, 2025

[Bugfix] Fix MTP bug with padded speculation #26198

Closed

[Spec Decode] Support EAGLE for Qwen3 MoE #26241

Closed

seven-mile changed the title ~~[Bugfix][SpecDecode] Fix wrong valid_mask for padded speculation when chunked prefill occurs~~ [Bugfix][Spec Decode] Fix wrong valid_mask for padded speculation when chunked prefill occurs Oct 5, 2025

tomasruizt mentioned this pull request Oct 6, 2025

[Bug]: Performance Regression in Acceptance length for EAGLE3 #26191

Open

1 task

benchislett approved these changes Oct 6, 2025

View reviewed changes

benchislett added the bug Something isn't working label Oct 6, 2025

mergify bot added the needs-rebase label Oct 6, 2025

Merge branch 'main' into fix-spec-padded-inputs-wrong-mask

827b235

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

mergify bot removed the needs-rebase label Oct 6, 2025

benchislett added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 6, 2025

benchislett enabled auto-merge (squash) October 6, 2025 16:07

MatthewBonanni mentioned this pull request Oct 6, 2025

[Core/DBO][3/N] Dual-Batch Overlap with speculative decode #24608

Closed

5 tasks

benchislett merged commit b2ea5ba into vllm-project:main Oct 6, 2025
48 checks passed

benchislett mentioned this pull request Oct 6, 2025

[Bugfix] Padded Eagle Specdec with Chunked Prefill #26263

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bugfix][Spec Decode] Fix wrong valid_mask for padded speculation when chunked prefill occurs #26231

[Bugfix][Spec Decode] Fix wrong valid_mask for padded speculation when chunked prefill occurs #26231

seven-mile commented Oct 4, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

benchislett left a comment

Uh oh!

benchislett commented Oct 6, 2025

Uh oh!

mergify bot commented Oct 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	# Generate a mask for all valid tokens within those requests
	max_gen_len = sampled_token_ids.shape[-1]
	if max_gen_len == 1:
	valid_mask = torch.ones_like(valid_sampled_token_ids_gpu,
	dtype=torch.bool)
	else:
	valid_mask = (
	(valid_sampled_token_ids_gpu != -1) &
	(valid_sampled_token_ids_gpu < gpu_input_batch.vocab_size))

Uh oh!

[Bugfix][Spec Decode] Fix wrong valid_mask for padded speculation when chunked prefill occurs #26231

[Bugfix][Spec Decode] Fix wrong valid_mask for padded speculation when chunked prefill occurs #26231

Conversation

seven-mile commented Oct 4, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

benchislett left a comment

Choose a reason for hiding this comment

Uh oh!

benchislett commented Oct 6, 2025

Uh oh!

mergify bot commented Oct 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

seven-mile commented Oct 4, 2025 •

edited by github-actions bot

Loading