[BUGFIX] Fix crash in Eagle Speculative Decoding models when exceedin… #24662

AlonKejzman · 2025-09-11T13:18:08Z

…g draft model length

Purpose

Enable running Eagle Speculative Decoding in environments where the input may exceed the drafter model length but not the verifier's.

Test Plan

Running with inputs that are below the drafter model length, and then between the drafter and verifier.

Test Result

Successful.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

github-actions · 2025-09-11T13:18:18Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

gemini-code-assist

Code Review

This pull request introduces a bugfix to prevent crashes in Eagle Speculative Decoding when the input sequence length exceeds the draft model's capacity. The fix correctly adds a check to early-exit from the propose method.

My review identifies a critical issue with the implementation of this early exit. The returned empty tensor has a hardcoded shape and incorrect data type, which can lead to crashes or incorrect behavior in batched scenarios. I've provided a suggestion to fix this by dynamically creating the tensor with the correct shape and dtype. I've also recommended removing the newly introduced constant, as it becomes obsolete with the suggested fix.

vllm/v1/spec_decode/eagle.py

tomasruizt · 2025-09-14T08:14:48Z

Could you provide commands to reproduce the issue?

AlonKejzman · 2025-09-14T10:30:16Z

Sure!

Serving the model

vllm serve meta-llama/Meta-Llama-3-8B-Instruct --speculative-config '{"method": "eagle", "model": "yuhuili/EAGLE-LLaMA3-Instruct-8B", "num_speculative_tokens": 4, "max_model_len": 2048}'

Request that makes it crash

import requests

url = "http://localhost:8000/v1/chat/completions"

headers = {
    "Authorization": "Bearer EMPTY", 
    "Content-Type": "application/json",
}

payload = {
    "model": "meta-llama/Meta-Llama-3-8B-Instruct",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello!" * 2049}
    ],
    "temperature": 0.0,
    "max_tokens": 1,
}

requests.post(url, headers=headers, json=payload, timeout=60)

tomasruizt · 2025-09-14T17:19:31Z

Is it intended behavior for the draft model to have a shorter model length compared to the target model?

If I understand the use case correctly, it's to use spec decoding only on short sequences and not on longer sequences.

If that is intended, then perhaps we should not be even calling the drafting method in the GPUModelRunner at all, but rather skipping it altogether.

benchislett · 2025-09-15T13:20:46Z

See also #22935 which is a more restrictive approach that doesn't let the serving engine launch with max_model_len larger than that of the drafter. I think the proper strategy is to simply skip drafting entirely from gpu_model_runner.propose_draft_token_ids() and not in Eagle.propose() since logic like prepare_inputs and other setup is would still run.

luccafong · 2025-09-15T16:43:01Z

vllm/v1/spec_decode/eagle.py

@@ -159,6 +159,14 @@ def propose(
        sampling_metadata: SamplingMetadata,
        mm_embeds: Optional[list[torch.Tensor]] = None,
    ) -> torch.Tensor:
+        # do not attempt to forward if the input size is too big
+        if common_attn_metadata.seq_lens.max(
+        ) + self.num_speculative_tokens > self.draft_model_config.max_model_len:


feels more like a general spec decoding issue instead of eagle specific issue, could we add it to gpu_model_runner?

vllm/vllm/v1/worker/gpu_model_runner.py

Line 2152 in 01413e0

self._draft_token_ids = self.propose_draft_token_ids(

mergify · 2025-09-21T15:32:08Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @AlonKejzman.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

AlonKejzman · 2025-09-21T15:42:51Z

@tomasruizt @luccafong You are right, I adjusted the fix accordingly
@benchislett WDYT? Maybe this is more flexible than #22935 since it allows for smaller drafter models while bypassing the drafter when needed?

vllm-project#24662) Signed-off-by: AlonKejzman <alonkeizman@gmail.com>

vllm-project#24662) Signed-off-by: AlonKejzman <alonkeizman@gmail.com> Signed-off-by: sergiopaniego <sergiopaniegoblanco@gmail.com>

#24662) Signed-off-by: AlonKejzman <alonkeizman@gmail.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>

vllm-project#24662) Signed-off-by: AlonKejzman <alonkeizman@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

vllm-project#24662) Signed-off-by: AlonKejzman <alonkeizman@gmail.com>

vllm-project#24662) Signed-off-by: AlonKejzman <alonkeizman@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

AlonKejzman requested review from benchislett and luccafong as code owners September 11, 2025 13:18

mergify bot added speculative-decoding v1 labels Sep 11, 2025

gemini-code-assist bot reviewed Sep 11, 2025

View reviewed changes

vllm/v1/spec_decode/eagle.py Outdated Show resolved Hide resolved

vllm/v1/spec_decode/eagle.py Outdated Show resolved Hide resolved

AlonKejzman force-pushed the main branch 3 times, most recently from 332386b to 58adccd Compare September 11, 2025 14:23

luccafong reviewed Sep 15, 2025

View reviewed changes

AlonKejzman force-pushed the main branch from 58adccd to db035da Compare September 21, 2025 15:31

AlonKejzman requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners September 21, 2025 15:31

mergify bot added the needs-rebase label Sep 21, 2025

AlonKejzman force-pushed the main branch 2 times, most recently from 2bbc4cf to 3ea00d4 Compare September 21, 2025 15:36

mergify bot removed the needs-rebase label Sep 21, 2025

AlonKejzman force-pushed the main branch from 3ea00d4 to 187dd68 Compare September 21, 2025 15:39

AlonKejzman force-pushed the main branch 2 times, most recently from 54e7eab to abaf634 Compare September 21, 2025 16:13

AlonKejzman requested review from DarkLight1337, NickLucche, ProExpertProg, aarnphm, chaunceyjiang, houseroad and yewentao256 as code owners September 25, 2025 12:19

mergify bot added documentation Improvements or additions to documentation frontend multi-modality Related to multi-modality (#4194) qwen Related to Qwen models structured-output labels Sep 25, 2025

github-project-automation bot added this to Structured Output Sep 25, 2025

AlonKejzman force-pushed the main branch 4 times, most recently from 90e74e5 to 515a675 Compare September 25, 2025 12:45

Merge branch 'main' into main

93ca8fd

benchislett enabled auto-merge (squash) September 25, 2025 13:17

benchislett merged commit e04a1b6 into vllm-project:main Sep 25, 2025
42 checks passed

github-project-automation bot moved this to Done in Structured Output Sep 25, 2025

lhtin mentioned this pull request Sep 29, 2025

[perf] Use CPU tensor to reduce GPU->CPU sync #25884

Merged

5 tasks

sergiopaniego pushed a commit to sergiopaniego/vllm that referenced this pull request Sep 29, 2025

[BUGFIX] Fix crash in Eagle Speculative Decoding models when exceedin… (

30abe33

vllm-project#24662) Signed-off-by: AlonKejzman <alonkeizman@gmail.com>

yewentao256 pushed a commit that referenced this pull request Oct 3, 2025

[BUGFIX] Fix crash in Eagle Speculative Decoding models when exceedin… (

252a0ff

#24662) Signed-off-by: AlonKejzman <alonkeizman@gmail.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>

benchislett mentioned this pull request Oct 6, 2025

[Bug]: EAGLE crashing on Blackwell #22755

Closed

1 task

choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025

[BUGFIX] Fix crash in Eagle Speculative Decoding models when exceedin… (

796ed02

vllm-project#24662) Signed-off-by: AlonKejzman <alonkeizman@gmail.com>

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025

[BUGFIX] Fix crash in Eagle Speculative Decoding models when exceedin… (

8e5a09c

vllm-project#24662) Signed-off-by: AlonKejzman <alonkeizman@gmail.com>

Uh oh!

[BUGFIX] Fix crash in Eagle Speculative Decoding models when exceedin… #24662

[BUGFIX] Fix crash in Eagle Speculative Decoding models when exceedin… #24662

Uh oh!

Conversation

AlonKejzman commented Sep 11, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

github-actions bot commented Sep 11, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

tomasruizt commented Sep 14, 2025

Uh oh!

AlonKejzman commented Sep 14, 2025

Serving the model

Request that makes it crash

Uh oh!

tomasruizt commented Sep 14, 2025

Uh oh!

benchislett commented Sep 15, 2025

Uh oh!

luccafong Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

AlonKejzman Sep 21, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Sep 21, 2025

Uh oh!

AlonKejzman commented Sep 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

AlonKejzman commented Sep 11, 2025 •

edited by github-actions bot

Loading