[V1][Feature] Enable Speculative Decoding with Structured Outputs #14702

benchislett · 2025-03-12T20:21:44Z

This PR changes the structured outputs behaviour in a few ways:

The generated speculative draft tokens are validated according to the grammar to ensure they match the grammar constraints. This operation should not modify the matcher state.
A bitmask is generated by the scheduler for each speculative draft token, such that the gpu runner can apply the appropriate bitmask to each position in the speculative sequence to ensure the sampled bonus token always adheres to the constraints. The easiest way I saw to accomplish this in the payload is to compact the bitmask and interleave the masks for the speculative tokens for each request. Then the gpu runner will always need to unpack the bitmask (since the compacted bitmask skips the non-structured requests even when speculative decoding is not enabled).

I also tweaked the benchmark file to fix some minor issues I had when running it locally, as well as modifying the prompt slightly to discourage infinite repetition when temperature==0. After this fix, I get 100% correctness when benchmarking with speculative decoding enabled.

NOTE: This PR is now compatible with both xGrammar and Guidance backends in V1.

…her.rollback) Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>

Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>

… decoding Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>

Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>

github-actions · 2025-03-12T20:21:54Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

mergify · 2025-03-12T20:22:23Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @benchislett.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…-outputs Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>

Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>

russellb · 2025-04-29T13:32:31Z

Let's get this in.

cc @russellb when you have bandwidth.

Yeah, I'm good with it. I just wanted to give @WoosukKwon a chance to do a final review, given that it touches such sensitive code.

russellb · 2025-04-29T13:33:57Z

also this needs to be updated against main. I tried to do it but don't have access to the branch.

WoosukKwon

LGTM!

WoosukKwon · 2025-04-14T00:03:58Z

benchmarks/benchmark_serving_structured_output.py


        def gen_prompt(index: int):
-            return f"Generate an example of a user profile given the following schema: {json.dumps(get_schema(index))}"  # noqa: E501
+            return f"Generate an example of a brief user profile given the following schema: {json.dumps(get_schema(index))}"  # noqa: E501


What is this change for?

some models are endlessly repeating outputs on the benchmark prompt. this tweak was enough to tip the scales

mergify · 2025-04-29T19:54:34Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @benchislett.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

WoosukKwon · 2025-04-29T19:55:11Z

@benchislett Thanks for the PR. Could you please merge from main? That will fix the docker build error.

…-decoding-with-structured-outputs Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>

benchislett · 2025-04-29T21:53:37Z

merged again.

russellb · 2025-04-30T01:14:41Z

Thank you for the hard work on this PR!

…lm-project#14702) Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai> Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com>

…lm-project#14702) Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai> Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>

…lm-project#14702) Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai> Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com> Signed-off-by: Yuqi Zhang <yuqizhang@google.com>

…ar_bitmask` method (#2022) ### What this PR does / why we need it? Fix #2033 Sync vllm-project/vllm#14702 to solve `grammar_bitmask` IndexError caused by outdated `apply_grammar_bitmask` method ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tested by upstream vllm - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@6e599ee Signed-off-by: ApsarasX <apsarax@outlook.com>

…rammar_bitmask` method (#2314) ### What this PR does / why we need it? Fix #2033 Sync vllm-project/vllm#14702 to solve `grammar_bitmask` IndexError caused by outdated `apply_grammar_bitmask` method ### Does this PR introduce _any_ user-facing change? No Signed-off-by: shen-shanshan <467638484@qq.com>

…ar_bitmask` method (vllm-project#2022) ### What this PR does / why we need it? Fix vllm-project#2033 Sync vllm-project/vllm#14702 to solve `grammar_bitmask` IndexError caused by outdated `apply_grammar_bitmask` method ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tested by upstream vllm - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@6e599ee Signed-off-by: ApsarasX <apsarax@outlook.com>

benchislett added 7 commits March 12, 2025 15:16

bump xgrammar dependency version (needed for bugfixes to grammar_matc…

b7c8a8a

…her.rollback) Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>

validate and filter draft tokens according to structured output grammar

5bc9ec3

Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>

remove request validation allowing spec+structured output

09acb1a

Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>

implement compressed-batch structured output bitmask with speculative…

a5b7521

… decoding Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>

fix num spec tokens

24e5e15

Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>

tweaks to benchmark script

36ed442

Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>

comment about perf improvement potential

730d81f

Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>

benchislett requested review from WoosukKwon, alexm-redhat, comaniac, mgoin, njhill, robertgshaw2-redhat, russellb and ywang96 as code owners March 12, 2025 20:21

mergify bot added the ci/build label Mar 12, 2025

mergify bot added v1 needs-rebase labels Mar 12, 2025

Merge branch 'main' into feat-v1-speculative-decoding-with-structured…

a15d5a0

…-outputs Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>

mergify bot removed the needs-rebase label Mar 12, 2025

LiuXiaoxuanPKU self-assigned this Mar 12, 2025

WoosukKwon self-assigned this Mar 12, 2025

bugfix

7726ae4

Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>

aarnphm self-requested a review March 13, 2025 06:58

aarnphm mentioned this pull request Apr 29, 2025

[V1][Experimental] Jump-forward decoding #15490

Draft

WoosukKwon approved these changes Apr 29, 2025

View reviewed changes

WoosukKwon added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 29, 2025

mergify bot added the needs-rebase label Apr 29, 2025

Merge remote-tracking branch 'upstream/main' into feat-v1-speculative…

07292cb

…-decoding-with-structured-outputs Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>

mergify bot removed the needs-rebase label Apr 29, 2025

mgoin approved these changes Apr 29, 2025

View reviewed changes

mgoin enabled auto-merge (squash) April 29, 2025 22:17

mgoin merged commit 34120f5 into vllm-project:main Apr 30, 2025
47 checks passed

github-project-automation bot moved this from In review to Done in Structured Output Apr 30, 2025

russellb mentioned this pull request Apr 30, 2025

[V0][Fix] structured decoding compatibility with speculative decoding #13823

Closed

shen-shanshan mentioned this pull request May 5, 2025

[V1][Structured Output] Enable Speculative Decoding with Structured Outputs vllm-project/vllm-ascend#751

Closed

1 task

ApsarasX mentioned this pull request Jul 25, 2025

[Bugfix] grammar_bitmask IndexError caused by outdated apply_grammar_bitmask method vllm-project/vllm-ascend#2022

Merged

shen-shanshan mentioned this pull request Aug 19, 2025

[Bugfix] Fix grammar_bitmask IndexError caused by outdated apply_grammar_bitmask method vllm-project/vllm-ascend#2314

Merged

Uh oh!

[V1][Feature] Enable Speculative Decoding with Structured Outputs #14702

[V1][Feature] Enable Speculative Decoding with Structured Outputs #14702

Uh oh!

Conversation

benchislett commented Mar 12, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 12, 2025

Uh oh!

mergify bot commented Mar 12, 2025

Uh oh!

russellb commented Apr 29, 2025

Uh oh!

russellb commented Apr 29, 2025

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

WoosukKwon Apr 14, 2025

Choose a reason for hiding this comment

Uh oh!

benchislett Apr 29, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Apr 29, 2025

Uh oh!

WoosukKwon commented Apr 29, 2025

Uh oh!

benchislett commented Apr 29, 2025

Uh oh!

Uh oh!

russellb commented Apr 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

benchislett commented Mar 12, 2025 •

edited by github-actions bot

Loading