-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
[Bugfix][Spec Decode] Fix wrong valid_mask for padded speculation when chunked prefill occurs #26231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix][Spec Decode] Fix wrong valid_mask for padded speculation when chunked prefill occurs #26231
Conversation
… chunked prefill occurs Signed-off-by: seven-mile <i@7li.moe>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a critical bug in speculative decoding where a faulty optimization for padded speculation during chunked prefill could lead to incorrect valid_mask generation. The author correctly identifies that the max_gen_len == 1 condition is not exclusive to the decode phase and can also occur during chunked prefills, causing sentinel values to be treated as valid. The proposed fix, which removes the problematic fast path and reverts to the more general and robust masking logic, is correct and effectively resolves the issue. This change prevents potential out-of-bounds errors and ensures the stability of speculative decoding under these conditions. The reasoning is sound, and the fix is well-justified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the fix!
An alternative way to fix this might be to change the index_fill_ to apply to the mask in the max_gen_len == 1 case. The fix implemented here is also fine and likely just as performant.
|
Please fix the conflicts with main |
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
…n chunked prefill occurs (vllm-project#26231) Signed-off-by: seven-mile <i@7li.moe> Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Co-authored-by: Benjamin Chislett <bchislett@nvidia.com>
…n chunked prefill occurs (vllm-project#26231) Signed-off-by: seven-mile <i@7li.moe> Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Co-authored-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
…n chunked prefill occurs (vllm-project#26231) Signed-off-by: seven-mile <i@7li.moe> Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Co-authored-by: Benjamin Chislett <bchislett@nvidia.com>
…n chunked prefill occurs (vllm-project#26231) Signed-off-by: seven-mile <i@7li.moe> Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Co-authored-by: Benjamin Chislett <bchislett@nvidia.com>
…n chunked prefill occurs (vllm-project#26231) Signed-off-by: seven-mile <i@7li.moe> Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Co-authored-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Purpose
I encountered an issue similar to #26198, but within EAGLE3, as they share the same code path.
The root cause is an "optimization" for calculating
valid_maskfor padded speculation, introduced by #24539:vllm/vllm/v1/spec_decode/eagle.py
Lines 504 to 512 in 5c057e0
max_gen_len == 1usually means speculation does not happen during decode phase. It seems reasonable to have a fast path for a full 1valid_mask. But note that any prefill input batch would also meet the condition.When a prefill request is chunked due to exceeding the token budget 8192, it contributes to
discard_request_indices(it's surely not supposed to be sampled before all prompt tokens are consumed). But the fast path still yields avalid_maskfull of1, which is incorrect and leads to sentinel values leaking and further OOB.Test Plan
This change removes a problematic optimization and reverts to the default behavior. I believe no complex tests are necessary.
I tested the fix on 2xH100 using the following commands:
vllm serve \ Qwen/Qwen3-30B-A3B \ --host 0.0.0.0 \ --port 7000 \ --seed 42 \ -dp 2 \ --enable-expert-parallel \ --enforce-eager \ --max-model-len 4096 \ --gpu_memory_utilization 0.8 \ --speculative-config '{"model":"Tengyunw/qwen3_30b_moe_eagle3","num_speculative_tokens":4}'Test Result
It works without any crash.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.