[V1][Spec Decode] Async scheduling integration with spec decode #22262

zixi-qi · 2025-08-05T16:38:27Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

Support async scheduling with speculative decoding with two changes:

Update number of output placeholders from 1 to 1 + len(spec tokens) and simplified the logic to keep number of output placeholders static.
When async scheduling is enabled, scheduler output is not up to date for speculative tokens and number of tokens rejected from latest model runner execution. To resolve this, this PR added caching within model runner for speculative token ids and number of computed tokens per request and use them to overwrite corresponding information in incoming scheduler output.

Test Plan

Added unit test
Ran e2e tests for acceptance rate, output quality and throughput.

Test Results

existing async scheduler unit tests pass

pytest -v tests/v1/core/test_async_scheduler.py

tests/v1/core/test_async_scheduler.py::test_stop_by_max_tokens[1] PASSED                                                                                                             [ 12%]
tests/v1/core/test_async_scheduler.py::test_stop_by_max_tokens[2] PASSED                                                                                                             [ 25%]
tests/v1/core/test_async_scheduler.py::test_stop_by_max_tokens[3] PASSED                                                                                                             [ 37%]
tests/v1/core/test_async_scheduler.py::test_stop_by_max_tokens[5] PASSED                                                                                                             [ 50%]
tests/v1/core/test_async_scheduler.py::test_abort PASSED                                                                                                                             [ 62%]
tests/v1/core/test_async_scheduler.py::test_preempt PASSED                                                                                                                           [ 75%]
tests/v1/core/test_async_scheduler.py::test_prefix_caching_for_prefill_dedup PASSED                                                                                                  [ 87%]
tests/v1/core/test_async_scheduler.py::test_prefix_caching_for_multi_turn PASSED

sanity checked output quality

VLLM_USE_V1=1 python examples/offline_inference/spec_decode.py --num_spec_tokens 5 --num_prompts 10 --dataset-name hf --dataset-path philschmid/mt-bench --async-scheduling --print-output

Adding requests: 100%|███████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 9118.05it/s]
Processed prompts: 100%|███████████████████████████████| 10/10 [00:02<00:00,  4.58it/s, est. speed input: 505.48 toks/s, output: 968.79 toks/s]
--------------------------------------------------
prompt: None
generated text: To evaluate the movie reviews, I will analyze the language and tone used in each review. Here's the evaluation:

1. This review is extremely positive, using words like "phenomenal" and "top-notch" to describe the movie. 
2. This review is extremely negative, using words like "disappointed", "predictable", and "worst" to describe the movie. However, the release year mentioned in the review is 2022, but the movie mentioned in the review was released in 2019. This inconsistency suggests that the reviewer is actually referring to a different movie. I will assume that the reviewer is referring to the movie released in 2019.
3. This review is neutral, using words like "okay" and "ordinary" to describe the movie.

Here's the evaluation as a JSON array of integers:

json
[
  5,
  1,
  3
]

Note: The second review's evaluation is based on the assumption that the reviewer is referring to the movie released in 2019, not 2022.
--------------------------------------------------
--------------------------------------------------
prompt: None
generated text: As you step into the vibrant marketplace, the cacophony of sounds envelops you - the chatter of vendors, the clanging of pots and pans, and the melodic calls of street performers fill the air. The scent of exotic spices wafts through the crowds, mingling with the sweet aroma of fresh fruit and the savory smell of sizzling street food. The visual feast is just as overwhelming, with vibrant colors and patterns on display: intricately woven textiles, gleaming silver jewelry, and pyramids of juicy produce. The air is thick with the smell of roasting coffee and the sound of vendors hawking their wares - "Fresh coconuts, only 50 cents!" and "Get your handmade crafts here!" The sun beats down on the crowded stalls, casting a warm glow over the scene. As you navigate through the throngs of people, the sensation of the sun on your skin and the sounds of the market create a sensory overload that is both exhilarating and exhausting. Amidst the chaos, a group of musicians begins to play a lively tune on their instruments, drawing a crowd of onlookers who sway to the rhythm, adding to the infectious energy of the marketplace.
--------------------------------------------------
--------------------------------------------------
prompt: None
generated text: The memories I hold within my ancient, gnarled heart. I've seen generations come and go, seasons rise and fall, and the world around me change in ways both big and small. But nothing could have prepared me for the feeling of being threatened with destruction.

As the deforesters approach, I feel a sense of unease, like a gentle rustling of leaves that grows into a fierce storm. My roots, which have dug deep into the earth for a century, begin to tremble with anxiety. I've seen so many of my friends and companions fall to the axe, their trunks shattered, their limbs torn asunder. The thought of joining them is a prospect too terrible to bear.

As the first blows strike, I feel a searing pain, like a thousand knives slicing through my bark. The sound of the chainsaw is a cacophony of terror, a deafening roar that threatens to consume me whole. I try to stand tall, to resist the onslaught, but it's no use. The weight of the axe, the force of the chainsaw, is too much for me to bear.

I feel my strength waning, my branches weakening, my trunk beginning to splinter. I'm being torn apart, piece by piece,
--------------------------------------------------
--------------------------------------------------
prompt: None
generated text: Given the seismically active area and the high-magnitude earthquakes experienced in the region, the best approach to constructing the bridge would be to prioritize seismic resilience and safety. Here's a comprehensive plan to ensure the bridge's stability and durability:

1. **Seismic Design and Analysis**: Engage a team of experienced structural engineers and seismologists to conduct a thorough seismic analysis of the site. This will involve:
	* Evaluating the local seismic hazard, including the frequency and magnitude of earthquakes.
	* Conducting a site-specific seismic hazard assessment to determine the expected ground motion.
	* Developing a seismic design that accounts for the site's specific conditions.
2. **Bridge Type and Design**: Choose a bridge type that is well-suited for seismic areas, such as:
	* A cable-stayed or suspension bridge, which can be designed to be more flexible and resistant to seismic forces.
	* A reinforced concrete or steel structure with a robust foundation system.
3. **Foundation Design**: Design a foundation system that can transfer seismic forces to the ground safely. This may include:
	* A deep foundation system, such as piles or caissons, to transfer loads to a stable rock or soil layer.
	* A raft foundation or a mat foundation to distribute loads
--------------------------------------------------
--------------------------------------------------
prompt: None
generated text: To identify the company with the highest profit in 2021, we need to compare the profit values of each company.

The profit values are:
- Company X: $3 billion
- Company Y: $6 billion
- Company Z: $7 billion
- Company W: $21 billion
- Company V: $25 billion
- Company U: $20 billion

The company with the highest profit in 2021 is Company V, with a profit of $25 billion. The CEO of Company V is Lisa Brown.
--------------------------------------------------
--------------------------------------------------
prompt: None
generated text: Here's the corrected paragraph:

She didn't remember where her purse was, so I thought it might be in the car, but he said it was on the kitchen table, but he wasn't sure. Then they asked me to look for it. She said, "Can you?" and I responded with, "Maybe, but I'm not sure." He didn't hear me, and he asked, "What?" Then he asked, "Did you find it?"

Corrected errors:

- "remembre" -> "remember"
- "where is" -> "where her purse was" (added "her purse" for clarity)
- "I thinks" -> "I thought" (correct verb form)
- "its" -> "it" (correct possessive form)
- "he's" -> "he said" (correct verb form)
- "he are" -> "he wasn't" (correct verb form)
- "looking for it" -> "look for it" (correct verb form)
- "she's say" -> "she said" (correct verb form)
- "I responds" -> "I responded" (correct verb form)
- "ain't" -> "I'm not" (correct contraction)
- "he not"
--------------------------------------------------
--------------------------------------------------
prompt: None
generated text: **Process:**

The reaction between solid calcium carbonate (CaCO3) and hydrochloric acid (HCl) is a type of acid-base reaction, also known as a neutralization reaction. In this reaction, the acid (HCl) reacts with the base (CaCO3) to form a salt (CaCl2), water (H2O), and carbon dioxide (CO2).

Here's a step-by-step description of the process:

1. The solid calcium carbonate (CaCO3) is placed in a container.
2. Hydrochloric acid (HCl) is slowly added to the calcium carbonate while stirring.
3. As the acid reacts with the base, the mixture starts to fizz and bubble, indicating the release of carbon dioxide gas.
4. The reaction mixture becomes warm, and a white precipitate of calcium chloride (CaCl2) may form.
5. The reaction is complete when the acid has been fully neutralized, and no more bubbles are produced.

**Balanced Chemical Equation:**

CaCO3 (s) + 2HCl (aq) → CaCl2 (aq) + H2O (l) + CO2 (g)

**Type of Reaction:**

This reaction is a type of acid-base
--------------------------------------------------
--------------------------------------------------
prompt: None
generated text: To solve the inequality |x + 5| < 10, we need to consider two cases:

Case 1: x + 5 ≥ 0
|x + 5| = x + 5
x + 5 < 10
x < 5

Case 2: x + 5 < 0
|x + 5| = -(x + 5)
-(x + 5) < 10
-x - 5 < 10
-x < 15
x > -15

Combining the two cases, we get:
-15 < x < 5

Now, we need to find the integers in this range. The integers in this range are:
-14, -13, -12, -11, -10, -9, -8, -7, -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4

There are 19 integers in the solution of the inequality |x + 5| < 10.
--------------------------------------------------
--------------------------------------------------
prompt: None
generated text: If I have just overtaken the second person, that means I have moved up to the second position. 

The person I just overtook is now in the third position.
--------------------------------------------------
--------------------------------------------------
prompt: None
generated text: Hand dryers.  A most intriguing topic.  Now, I've given this considerable thought, and I must say, I find them to be a most... efficient use of technology.  However, I do have some reservations regarding their effectiveness.  You see, I've conducted experiments, and I've found that hand dryers often fail to completely dry one's hands, particularly in colder climates.  This, of course, leads to a higher risk of bacterial and fungal infections.

Furthermore, I've noticed that many hand dryers are not designed with the optimal air flow in mind.  They often blow air at an angle, rather than directly at the hands, which can lead to a less-than-desirable drying experience.  And don't even get me started on the noise level.  Some of these things are as loud as a jet engine taking off.

Now, I know what you're thinking: "Sheldon, why not just use paper towels?"  Well, my friend, paper towels are a far more hygienic option, but they're also a waste of resources.  I mean, think about it: all that paper, just being used once and then discarded.  It's a travesty, really.

In conclusion, while hand dry
--------------------------------------------------
--------------------------------------------------
total_num_output_tokens: 2114
num_drafts: 944
num_draft_tokens: 4720
num_accepted_tokens: 1147
mean acceptance length: 2.22
--------------------------------------------------
acceptance at token 0: 0.63
acceptance at token 1: 0.33
acceptance at token 2: 0.15
acceptance at token 3: 0.07
acceptance at token 4: 0.03

acceptance rate & throughput with async scheduling

Adding requests: 100%|██████████████████████████████████████████████████████████████████████████████████████| 80/80 [00:00<00:00, 16683.79it/s]
Processed prompts: 100%|█████████████████████████████| 80/80 [00:02<00:00, 26.88it/s, est. speed input: 2706.33 toks/s, output: 5741.97 toks/s]
--------------------------------------------------
total_num_output_tokens: 17086
num_drafts: 6880
num_draft_tokens: 34400
num_accepted_tokens: 10026
mean acceptance length: 2.46
--------------------------------------------------
acceptance at token 0: 0.69
acceptance at token 1: 0.40
acceptance at token 2: 0.21
acceptance at token 3: 0.11
acceptance at token 4: 0.05

acceptance rate & throughput without async scheduling

Adding requests: 100%|██████████████████████████████████████████████████████████████████████████████████████| 80/80 [00:00<00:00, 16828.54it/s]
Processed prompts: 100%|█████████████████████████████| 80/80 [00:03<00:00, 26.12it/s, est. speed input: 2629.76 toks/s, output: 5567.79 toks/s]
--------------------------------------------------
total_num_output_tokens: 17050
num_drafts: 6980
num_draft_tokens: 34900
num_accepted_tokens: 10102
mean acceptance length: 2.45
--------------------------------------------------
acceptance at token 0: 0.68
acceptance at token 1: 0.40
acceptance at token 2: 0.21
acceptance at token 3: 0.11
acceptance at token 4: 0.05

github-actions · 2025-08-05T16:38:34Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request aims to integrate asynchronous scheduling with speculative decoding. The changes involve updating how output placeholders are handled and caching speculative decoding results in the model runner to cope with the one-step delay in the async scheduler. However, there's a critical issue in vllm/v1/core/sched/async_scheduler.py that leads to an incorrect calculation of the number of tokens to cache, causing an AssertionError as described in the PR description. My review provides a fix for this issue.

mergify · 2025-08-11T16:52:52Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @zixi-qi.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: qizixi <qizixi@meta.com>

cadedaniel · 2025-08-23T21:59:34Z

I find with multiple API servers, the acceptance rate can drop: Mean acceptance length: 1.89 vs Mean acceptance length: 3.90.

This might be some unbalanced routing in the API server layer, but I would find that surprising?

(ApiServer_0 pid=324041) INFO 08-23 21:53:38 [loggers.py:123] Engine 000: Avg prompt throughput: 6048.5 tokens/s, Avg generation throughput: 8966.3 tokens/s, Running: 12 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.5%, Prefix cache hit rate: 0.0%
(ApiServer_0 pid=324041) INFO 08-23 21:53:38 [metrics.py:87] SpecDecoding metrics: Draft acceptance rate: 74.5%, Mean acceptance length: 3.90, Accepted: 73710 tokens, Drafted: 99002 tokens, Per-position acceptance rate: 0.789, 0.731, 0.699, 0.683
(ApiServer_1 pid=324042) INFO 08-23 21:53:38 [loggers.py:123] Engine 000: Avg prompt throughput: 6346.7 tokens/s, Avg generation throughput: 8558.8 tokens/s, Running: 4 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.5%, Prefix cache hit rate: 0.0%
(ApiServer_1 pid=324042) INFO 08-23 21:53:38 [metrics.py:87] SpecDecoding metrics: Draft acceptance rate: 22.2%, Mean acceptance length: 1.89, Accepted: 533 tokens, Drafted: 2404 tokens, Per-position acceptance rate: 0.246, 0.241, 0.200, 0.200

Without this PR, the mean acceptance length is ~3.85 on the same fixed dataset.

Startup command, I'm on vllm 79899b6

vllm serve llama-8b \
    --max-num-batched-tokens 16384 \
    --max-num-seqs 1536 \
    --tensor-parallel-size 1 \ 
    --api-server-count 2 \ 
    --speculative-config '{"model": "ngram", "num_speculative_tokens": 7, "prompt_lookup_max": 7, "prompt_lookup_min": 1}' \
    --no-enable-prefix-caching \
    --port 8002 \
    --async-scheduling

njhill · 2025-08-24T01:14:36Z

@cadedaniel unfortunately due to how the metrics are computed they aren't correct when logged from multiple API servers: #21954. I'm guessing that's the reason for the discrepancy you're seeing. The corresponding prometheus metrics however should be correct, since they're aggregated via the prometheus client multi-proc support.

mergify · 2025-10-08T14:28:21Z

Documentation preview: https://vllm--22262.org.readthedocs.build/en/22262/

zixi-qi force-pushed the async-scheduler-with-spec-decode branch from 9a6f640 to 82deff1 Compare August 5, 2025 16:39

mergify bot added documentation Improvements or additions to documentation speculative-decoding v1 labels Aug 5, 2025

gemini-code-assist bot reviewed Aug 5, 2025

View reviewed changes

zixi-qi removed the documentation Improvements or additions to documentation label Aug 8, 2025

mergify bot added the documentation Improvements or additions to documentation label Aug 8, 2025

zixi-qi force-pushed the async-scheduler-with-spec-decode branch 2 times, most recently from 3b9ddec to 6f64741 Compare August 10, 2025 22:39

zixi-qi changed the title ~~[WIP] Async scheduling integration with spec decode~~ [V1][Spec Decode] Async scheduling integration with spec decode Aug 10, 2025

zixi-qi marked this pull request as ready for review August 10, 2025 23:03

zixi-qi requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners August 10, 2025 23:03

zixi-qi requested review from houseroad, luccafong and morgendave August 10, 2025 23:04

mergify bot added the needs-rebase label Aug 11, 2025

zixi-qi added 4 commits August 11, 2025 10:10

refactor async_scheduler to keep num_output_placeholders constant

d8faf45

Signed-off-by: qizixi <qizixi@meta.com>

async scheduling + spec decode

1dee015

Signed-off-by: qizixi <qizixi@meta.com>

further fixes

5fbbc6f

Signed-off-by: qizixi <qizixi@meta.com>

fix issue and add unit test

489c91d

Signed-off-by: qizixi <qizixi@meta.com>

zixi-qi force-pushed the async-scheduler-with-spec-decode branch from 6f64741 to 489c91d Compare August 11, 2025 17:16

mergify bot removed the needs-rebase label Aug 11, 2025

JaheimLee mentioned this pull request Aug 29, 2025

[Perf][V1] Fully overlap model execution #23569

Merged

benchislett mentioned this pull request Sep 3, 2025

[Bugfix] [Performance]Better MTP Support when use flashmla #24045

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

[V1][Spec Decode] Async scheduling integration with spec decode #22262

[V1][Spec Decode] Async scheduling integration with spec decode #22262

zixi-qi commented Aug 5, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Aug 5, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

mergify bot commented Aug 11, 2025

Uh oh!

cadedaniel commented Aug 23, 2025 •

edited

Loading

Uh oh!

njhill commented Aug 24, 2025

Uh oh!

mergify bot commented Oct 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Uh oh!

[V1][Spec Decode] Async scheduling integration with spec decode #22262

Are you sure you want to change the base?

[V1][Spec Decode] Async scheduling integration with spec decode #22262

Conversation

zixi-qi commented Aug 5, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan

Test Results

Uh oh!

github-actions bot commented Aug 5, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

mergify bot commented Aug 11, 2025

Uh oh!

cadedaniel commented Aug 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

njhill commented Aug 24, 2025

Uh oh!

mergify bot commented Oct 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zixi-qi commented Aug 5, 2025 •

edited by github-actions bot

Loading

cadedaniel commented Aug 23, 2025 •

edited

Loading