[Core] Async Scheduling X Spec Decoding Compatibility #24799

Ronald1995 · 2025-09-13T07:38:11Z

Purpose

PR #19970 implements async_scheduling, PR #23569 implement prepare_input overlap base on PR #19970. RP #24539 refactor the logic of eagle spec_code， make it don't rely on cpu's sample_token_ids.

this PR is based on #24539 , and aims to support spec decode with async_scheduling. when enable both async_scheduling and spec decode, we won't copy draft token ids to scheduler any more, but cache it in gpu_model_runner, and update the input_ids with the _draft_token_ids directly for next step execute_model.

because ngram and medusa rely on cpu's sample_token_ids now, maybe we could refactor it in the future, but now this PR
only support eagle spec_decode with async_scheduling.

Test Plan

we will make the e2e test.

async_scheduling + EAGLE-LLaMA3-Instruct-8B draft model, make sure it works well.

Test config:

# dataset is prm800k, read the jsonl and make prompts.
sampling_params = SamplingParams(temperature=0, max_tokens=1024)
llm = LLM(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    gpu_memory_utilization=0.9,
    tensor_parallel_size=1,
    max_model_len=2048,
    max_num_seqs=128,
    max_num_batched_tokens=4096,
    async_scheduling=True, 
    speculative_config={
            "model": "yuhuili/EAGLE-LLaMA3-Instruct-8B",
            "draft_tensor_parallel_size": 1,
            "num_speculative_tokens": 2,
            "method": "eagle",
        },
    seed=1234
)

test device: Nvidia A100

Test Result

performance

num_prompts	async_scheduling(tps)	sync_scheduling(tps)	speedup
24	2356	2314	1.8%
48	3759	3539	6.2%
96	5110	4770	7.1%

precision

I compare the outputs of async_scheduling and sync_scheduling with speculative decoding,
the outputs are actually the same. so the async_scheduling doesn't make precision problem.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

mergify · 2025-09-13T07:38:53Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Ronald1995.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request adds support for speculative decoding with asynchronous scheduling, which is a great feature enhancement. The core logic of handling draft tokens within the worker process for async scheduling is sound. However, I've identified a few critical issues in gpu_model_runner.py related to tensor manipulation for scatter operations that will likely cause runtime errors. There's also a minor logic error in how speculative token lists are truncated. The proposed fixes are straightforward. Once these issues are addressed, the implementation should be solid.

vllm/v1/worker/gpu_model_runner.py

mergify · 2025-09-18T05:41:00Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Ronald1995.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2025-10-31T01:43:24Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Ronald1995.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>

vllm/v1/engine/core.py

fix typo and grammar Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com>

benchislett

Thanks for the contribution and continued effort. I am satisfied with the state of the PR

vadiklyutiy · 2025-11-01T01:17:55Z

May I double check.
After the PR merged, asynv scheduling will work fine with spec decoding?

vadiklyutiy · 2025-11-01T01:18:32Z

@benchislett Maybe it's worth to run CI against this PR?

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>

Ronald1995 · 2025-11-01T03:12:22Z

May I double check. After the PR merged, asynv scheduling will work fine with spec decoding?

The answer is yes. you can see e2e testcase test_async_scheduling.py, async_scheduling with spec decode test is added to the testcase.

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>

benchislett · 2025-11-04T05:05:44Z

Waiting for final sign-off from @njhill or @WoosukKwon before marking as "ready"

njhill · 2025-11-04T06:22:03Z

Apologies I was battling a perf regression and more CI issues today but will focus on this tomorrow.

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>

njhill

Thanks @Ronald1995 ... I'm still in the middle of reviewing but posting a few comments as I go which you could maybe start to look at in parallel...

njhill · 2025-11-05T02:22:17Z

tests/v1/e2e/test_async_scheduling.py

-                                    )
-                                    assert _all_logprobs_match(
-                                        results[0][1], other_test_logprobs
+    def test_with_spec_decoding(self, monkeypatch: pytest.MonkeyPatch):


It would be good for the baseline run of this to not use spec decoding, so that we compare the output with that too.

ok, i will complete the testcase

njhill · 2025-11-05T02:25:15Z

tests/v1/e2e/test_async_scheduling.py

+                                            results[0][1], other_test_logprobs
+                                        )
+
+                            outputs.append((test_config, results))


It would be good to record the acceptance rate for each test run here and compare that at the end too (for the spec decoding cases of course).

I think this could be done fairly easily via the exposed metrics, you can look at other tests as an example, e.g.

vllm/tests/v1/e2e/test_spec_decode.py

Lines 162 to 163 in 18b3982

# Collect draft and acceptance stats.

metrics = spec_llm.get_metrics()

ok, i will add this

tests/v1/e2e/test_async_scheduling.py

njhill · 2025-11-05T02:27:43Z

vllm/v1/engine/core.py

+        if (
+            (sampling_params := request.sampling_params)
+            and self.use_spec_decode
+            and self.async_scheduling
+        ):


We should move these checks to the front-end. I can do this soon in another branch based on this one...

front-end you mentioned is coreclient? i tried to add the check in add_request method of coreclient, but there are kinds of of coreclient, so i add the check logic in core.py in case of too many modifications for coreclient.

@njhill i have fix the issues about e2e test. Thanks.

Thanks @Ronald1995. I guess it would be good if possible to make these additional changes to the test:

Have the baseline for output comparisons be the non-spec decode run

Don't run any other non-spec server configs as part of the spec decode test (since we don't really need to test those and it adds significant time having to start the larger model)

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>

Signed-off-by: Nick Hill <nhill@redhat.com>

njhill · 2025-11-06T22:52:10Z

front-end you mentioned is coreclient? i tried to add the check in add_request method of coreclient, but there are kinds of of coreclient, so i add the check logic in core.py in case of too many modifications for coreclient.

@Ronald1995 I've pushed a commit with this change (moving the validation to processor.py), along with some other minor simplification and comment cleanup, I hope this is ok.

njhill · 2025-11-06T22:53:13Z

tests/v1/e2e/test_async_scheduling.py

+            monkeypatch,
+            MTP_MODEL,
+            [{}],
+            spec_configs=[{"method": "mtp", "num_speculative_tokens": 1}, None],


Could we change this to 2? Perhaps it doesn't make a difference but I feel like it may avoid some >1 edge case.

Suggested change

spec_configs=[{"method": "mtp", "num_speculative_tokens": 1}, None],

spec_configs=[{"method": "mtp", "num_speculative_tokens": 2}, None],

njhill · 2025-11-06T23:01:32Z

@Ronald1995 what will happen if input_fits_in_drafter is False here:

vllm/vllm/v1/worker/gpu_model_runner.py

Lines 2692 to 2695 in ca6f755

    
           if use_padded_batch_for_eagle and input_fits_in_drafter: 
        
               # EAGLE speculative decoding can use the GPU sampled tokens 
        
               # as inputs, and does not need to wait for bookkeeping to finish. 
        
               propose_draft_token_ids(sampler_output.sampled_token_ids)

?

I think it won't update self._draft_token_ids, input_batch.prev_sampled_token_ids, or self.valid_sampled_token_count_cpu, and won't call record() on the event which I expect will mean that things hang?

I think we also need to address that case if so.

njhill · 2025-11-06T23:04:43Z

vllm/engine/arg_utils.py

+                if self.speculative_config.get("method") not in get_args(
+                    EagleModelTypes
+                ):
+                    raise ValueError(
+                        "Currently, async scheduling is only supported "
+                        "with EAGLE/MTP kind of speculative decoding"
+                    )
+                elif self.speculative_config.get("disable_padded_drafter_batch"):
+                    raise ValueError(
+                        "async scheduling for EAGLE/MTP kind of speculative "
+                        "decoding is enabled, but disable_padded_drafter_batch=True "
+                        "disable_padded_drafter_batch=True is not supported for "
+                        "this situation now. please set "
+                        "disable_padded_drafter_batch=Fasle"


I am moving the async scheduling config validation into VllmConfig in #28250. Hopefully that PR can be merged quickly and we can then move this check along with that.

Ronald1995 requested review from WoosukKwon, alexm-redhat, comaniac, heheda12345, njhill, robertgshaw2-redhat and ywang96 as code owners September 13, 2025 07:38

mergify bot added the v1 label Sep 13, 2025

mergify bot added the needs-rebase label Sep 13, 2025

gemini-code-assist bot reviewed Sep 13, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

Ronald1995 force-pushed the async_scheduling_for_spec_decode branch 2 times, most recently from f417e8f to b530bf3 Compare September 13, 2025 07:57

mergify bot removed the needs-rebase label Sep 13, 2025

Ronald1995 force-pushed the async_scheduling_for_spec_decode branch from 8172c2b to 163f9ab Compare September 13, 2025 09:42

Ronald1995 requested review from benchislett and luccafong as code owners September 13, 2025 09:42

mergify bot added the speculative-decoding label Sep 13, 2025

robertgshaw2-redhat changed the title ~~async_scheduling for sepc code~~ [Core] Async Scheduling X Spec Decoding Compatibility Sep 13, 2025

Ronald1995 force-pushed the async_scheduling_for_spec_decode branch from 4466156 to f971753 Compare September 15, 2025 01:29

benchislett mentioned this pull request Sep 16, 2025

[Spec Decode] Efficient padded speculation #24539

Merged

mergify bot added the needs-rebase label Sep 18, 2025

Ronald1995 changed the title ~~[Core] Async Scheduling X Spec Decoding Compatibility~~ [WIP][Core] Async Scheduling X Spec Decoding Compatibility Sep 19, 2025

Ronald1995 force-pushed the async_scheduling_for_spec_decode branch from 13773be to 337aab8 Compare September 20, 2025 11:51

Ronald1995 requested a review from ApostaC as a code owner September 20, 2025 11:51

mergify bot removed the needs-rebase label Sep 20, 2025

Ronald1995 force-pushed the async_scheduling_for_spec_decode branch 3 times, most recently from 3630428 to 3ad3c1b Compare September 21, 2025 09:20

mergify bot added the needs-rebase label Oct 31, 2025

merge origin/main into liurong/async_scheduling_for_spec_decode

21c9fc4

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>

mergify bot removed the needs-rebase label Oct 31, 2025

Ronald1995 force-pushed the async_scheduling_for_spec_decode branch from 9885f53 to bc3b8a6 Compare October 31, 2025 02:26

fix e2e test for spec_decoding

c82429d

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>

Ronald1995 force-pushed the async_scheduling_for_spec_decode branch from bc3b8a6 to c82429d Compare October 31, 2025 02:30

benchislett reviewed Oct 31, 2025

View reviewed changes

vllm/v1/engine/core.py Outdated Show resolved Hide resolved

Update vllm/v1/engine/core.py

42a3a0c

fix typo and grammar Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com>

benchislett approved these changes Oct 31, 2025

View reviewed changes

merge origin/main into Ronald1995/async_scheduling_for_spec_decode

9be4b83

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>

sfc-gh-yewang added a commit to sfc-gh-yewang/vllm that referenced this pull request Nov 1, 2025

async spec from vllm-project#24799

468ace1

fix mypy error

409e504

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>

Ronald1995 force-pushed the async_scheduling_for_spec_decode branch from 5667a1c to 409e504 Compare November 1, 2025 02:42

merge origin/main into Ronald1995/async_scheduling_for_spec_decode

ebf0674

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>

merge origin/main into Ronald1995/async_scheduling_for_spec_decode

f3dddfb

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>

njhill reviewed Nov 5, 2025

View reviewed changes

Ronald1995 force-pushed the async_scheduling_for_spec_decode branch from 9338a9b to ff928a2 Compare November 5, 2025 09:07

complete e2e test

5aa2e38

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>

Ronald1995 force-pushed the async_scheduling_for_spec_decode branch from ff928a2 to 5aa2e38 Compare November 5, 2025 09:39

Some simplification and comment cleanup

67209c7

Signed-off-by: Nick Hill <nhill@redhat.com>

njhill reviewed Nov 6, 2025

View reviewed changes

	# Collect draft and acceptance stats.
	metrics = spec_llm.get_metrics()

	spec_configs=[{"method": "mtp", "num_speculative_tokens": 1}, None],
	spec_configs=[{"method": "mtp", "num_speculative_tokens": 2}, None],

Uh oh!

[Core] Async Scheduling X Spec Decoding Compatibility #24799

Are you sure you want to change the base?

[Core] Async Scheduling X Spec Decoding Compatibility #24799

Conversation

Ronald1995 commented Sep 13, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

performance

precision

Uh oh!

mergify bot commented Sep 13, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Sep 18, 2025

Uh oh!

mergify bot commented Oct 31, 2025

Uh oh!

Uh oh!

benchislett left a comment

Choose a reason for hiding this comment

Uh oh!

vadiklyutiy commented Nov 1, 2025

Uh oh!

vadiklyutiy commented Nov 1, 2025

Uh oh!

Ronald1995 commented Nov 1, 2025

Uh oh!

benchislett commented Nov 4, 2025

Uh oh!

njhill commented Nov 4, 2025

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ronald1995 Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ronald1995 Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

njhill commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

njhill commented Nov 6, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

Ronald1995 commented Sep 13, 2025 •

edited by github-actions bot

Loading

Ronald1995 Nov 5, 2025 •

edited

Loading

Ronald1995 Nov 6, 2025 •

edited

Loading

njhill commented Nov 6, 2025 •

edited

Loading