Skip to content

Conversation

@Ronald1995
Copy link
Contributor

@Ronald1995 Ronald1995 commented Sep 13, 2025

Purpose

PR #19970 implements async_scheduling, PR #23569 implement prepare_input overlap base on PR #19970. RP #24539 refactor the logic of eagle spec_code, make it don't rely on cpu's sample_token_ids.

this PR is based on #24539 , and aims to support spec decode with async_scheduling. when enable both async_scheduling and spec decode, we won't copy draft token ids to scheduler any more, but cache it in gpu_model_runner, and update the input_ids with the _draft_token_ids directly for next step execute_model.

because ngram and medusa rely on cpu's sample_token_ids now, maybe we could refactor it in the future, but now this PR
only support eagle spec_decode with async_scheduling.

Test Plan

we will make the e2e test.

  • async_scheduling + EAGLE-LLaMA3-Instruct-8B draft model, make sure it works well.

Test config:

# dataset is prm800k, read the jsonl and make prompts.
sampling_params = SamplingParams(temperature=0, max_tokens=1024)
llm = LLM(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    gpu_memory_utilization=0.9,
    tensor_parallel_size=1,
    max_model_len=2048,
    max_num_seqs=128,
    max_num_batched_tokens=4096,
    async_scheduling=True, 
    speculative_config={
            "model": "yuhuili/EAGLE-LLaMA3-Instruct-8B",
            "draft_tensor_parallel_size": 1,
            "num_speculative_tokens": 2,
            "method": "eagle",
        },
    seed=1234
)

test device: Nvidia A100

Test Result

performance

num_prompts async_scheduling(tps) sync_scheduling(tps) speedup
24 2356 2314 1.8%
48 3759 3539 6.2%
96 5110 4770 7.1%

precision

I compare the outputs of async_scheduling and sync_scheduling with speculative decoding,
the outputs are actually the same. so the async_scheduling doesn't make precision problem.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify
Copy link

mergify bot commented Sep 13, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Ronald1995.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 13, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for speculative decoding with asynchronous scheduling, which is a great feature enhancement. The core logic of handling draft tokens within the worker process for async scheduling is sound. However, I've identified a few critical issues in gpu_model_runner.py related to tensor manipulation for scatter operations that will likely cause runtime errors. There's also a minor logic error in how speculative token lists are truncated. The proposed fixes are straightforward. Once these issues are addressed, the implementation should be solid.

@Ronald1995 Ronald1995 force-pushed the async_scheduling_for_spec_decode branch 2 times, most recently from f417e8f to b530bf3 Compare September 13, 2025 07:57
@mergify mergify bot removed the needs-rebase label Sep 13, 2025
@Ronald1995 Ronald1995 force-pushed the async_scheduling_for_spec_decode branch from 8172c2b to 163f9ab Compare September 13, 2025 09:42
@robertgshaw2-redhat robertgshaw2-redhat changed the title async_scheduling for sepc code [Core] Async Scheduling X Spec Decoding Compatibility Sep 13, 2025
@Ronald1995 Ronald1995 force-pushed the async_scheduling_for_spec_decode branch from 4466156 to f971753 Compare September 15, 2025 01:29
@mergify
Copy link

mergify bot commented Sep 18, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Ronald1995.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 18, 2025
@Ronald1995 Ronald1995 changed the title [Core] Async Scheduling X Spec Decoding Compatibility [WIP][Core] Async Scheduling X Spec Decoding Compatibility Sep 19, 2025
@Ronald1995 Ronald1995 force-pushed the async_scheduling_for_spec_decode branch from 13773be to 337aab8 Compare September 20, 2025 11:51
@Ronald1995 Ronald1995 requested a review from ApostaC as a code owner September 20, 2025 11:51
@mergify mergify bot removed the needs-rebase label Sep 20, 2025
@Ronald1995 Ronald1995 force-pushed the async_scheduling_for_spec_decode branch 3 times, most recently from 3630428 to 3ad3c1b Compare September 21, 2025 09:20
@mergify
Copy link

mergify bot commented Oct 31, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Ronald1995.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Oct 31, 2025
Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
@mergify mergify bot removed the needs-rebase label Oct 31, 2025
@Ronald1995 Ronald1995 force-pushed the async_scheduling_for_spec_decode branch from 9885f53 to bc3b8a6 Compare October 31, 2025 02:26
Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
@Ronald1995 Ronald1995 force-pushed the async_scheduling_for_spec_decode branch from bc3b8a6 to c82429d Compare October 31, 2025 02:30
fix typo and grammar

Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com>
Copy link
Collaborator

@benchislett benchislett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution and continued effort. I am satisfied with the state of the PR

@vadiklyutiy
Copy link
Collaborator

May I double check.
After the PR merged, asynv scheduling will work fine with spec decoding?

@vadiklyutiy
Copy link
Collaborator

@benchislett Maybe it's worth to run CI against this PR?

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
sfc-gh-yewang added a commit to sfc-gh-yewang/vllm that referenced this pull request Nov 1, 2025
Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
@Ronald1995 Ronald1995 force-pushed the async_scheduling_for_spec_decode branch from 5667a1c to 409e504 Compare November 1, 2025 02:42
@Ronald1995
Copy link
Contributor Author

May I double check. After the PR merged, asynv scheduling will work fine with spec decoding?

The answer is yes. you can see e2e testcase test_async_scheduling.py, async_scheduling with spec decode test is added to the testcase.

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
@benchislett
Copy link
Collaborator

Waiting for final sign-off from @njhill or @WoosukKwon before marking as "ready"

@njhill
Copy link
Member

njhill commented Nov 4, 2025

Apologies I was battling a perf regression and more CI issues today but will focus on this tomorrow.

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Ronald1995 ... I'm still in the middle of reviewing but posting a few comments as I go which you could maybe start to look at in parallel...

)
assert _all_logprobs_match(
results[0][1], other_test_logprobs
def test_with_spec_decoding(self, monkeypatch: pytest.MonkeyPatch):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good for the baseline run of this to not use spec decoding, so that we compare the output with that too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, i will complete the testcase

results[0][1], other_test_logprobs
)

outputs.append((test_config, results))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to record the acceptance rate for each test run here and compare that at the end too (for the spec decoding cases of course).

I think this could be done fairly easily via the exposed metrics, you can look at other tests as an example, e.g.

# Collect draft and acceptance stats.
metrics = spec_llm.get_metrics()

Copy link
Contributor Author

@Ronald1995 Ronald1995 Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, i will add this

Comment on lines 299 to 303
if (
(sampling_params := request.sampling_params)
and self.use_spec_decode
and self.async_scheduling
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should move these checks to the front-end. I can do this soon in another branch based on this one...

Copy link
Contributor Author

@Ronald1995 Ronald1995 Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

front-end you mentioned is coreclient? i tried to add the check in add_request method of coreclient, but there are kinds of of coreclient, so i add the check logic in core.py in case of too many modifications for coreclient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@njhill i have fix the issues about e2e test. Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Ronald1995. I guess it would be good if possible to make these additional changes to the test:

  • Have the baseline for output comparisons be the non-spec decode run
  • Don't run any other non-spec server configs as part of the spec decode test (since we don't really need to test those and it adds significant time having to start the larger model)

@Ronald1995 Ronald1995 force-pushed the async_scheduling_for_spec_decode branch from 9338a9b to ff928a2 Compare November 5, 2025 09:07
Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
@Ronald1995 Ronald1995 force-pushed the async_scheduling_for_spec_decode branch from ff928a2 to 5aa2e38 Compare November 5, 2025 09:39
Signed-off-by: Nick Hill <nhill@redhat.com>
@njhill
Copy link
Member

njhill commented Nov 6, 2025

front-end you mentioned is coreclient? i tried to add the check in add_request method of coreclient, but there are kinds of of coreclient, so i add the check logic in core.py in case of too many modifications for coreclient.

@Ronald1995 I've pushed a commit with this change (moving the validation to processor.py), along with some other minor simplification and comment cleanup, I hope this is ok.

monkeypatch,
MTP_MODEL,
[{}],
spec_configs=[{"method": "mtp", "num_speculative_tokens": 1}, None],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we change this to 2? Perhaps it doesn't make a difference but I feel like it may avoid some >1 edge case.

Suggested change
spec_configs=[{"method": "mtp", "num_speculative_tokens": 1}, None],
spec_configs=[{"method": "mtp", "num_speculative_tokens": 2}, None],

@njhill
Copy link
Member

njhill commented Nov 6, 2025

@Ronald1995 what will happen if input_fits_in_drafter is False here:

if use_padded_batch_for_eagle and input_fits_in_drafter:
# EAGLE speculative decoding can use the GPU sampled tokens
# as inputs, and does not need to wait for bookkeeping to finish.
propose_draft_token_ids(sampler_output.sampled_token_ids)
?

I think it won't update self._draft_token_ids, input_batch.prev_sampled_token_ids, or self.valid_sampled_token_count_cpu, and won't call record() on the event which I expect will mean that things hang?

I think we also need to address that case if so.

Comment on lines +1514 to +1527
if self.speculative_config.get("method") not in get_args(
EagleModelTypes
):
raise ValueError(
"Currently, async scheduling is only supported "
"with EAGLE/MTP kind of speculative decoding"
)
elif self.speculative_config.get("disable_padded_drafter_batch"):
raise ValueError(
"async scheduling for EAGLE/MTP kind of speculative "
"decoding is enabled, but disable_padded_drafter_batch=True "
"disable_padded_drafter_batch=True is not supported for "
"this situation now. please set "
"disable_padded_drafter_batch=Fasle"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am moving the async scheduling config validation into VllmConfig in #28250. Hopefully that PR can be merged quickly and we can then move this check along with that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants