[V0][Fix] structured decoding compatibility with speculative decoding #13823

southfreebird · 2025-02-25T11:08:52Z

This PR was created by the Nebius team.

The main focus of this PR is to fix guided generation for speculative decoding. We found that when using the xGrammar backend with speculative decoding, vLLM crashes here. This PR addresses the issue by using a rollback mechanism in xGrammar.

github-actions · 2025-02-25T11:09:05Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

hmellor · 2025-02-25T12:57:52Z

Thanks for the contribution!

Please can you make sure that your commits are signed off (instructions here).

Also, some of the entrypoint tests are failing with:

[2025-02-25T12:06:05Z]     def _vocab_size(self) -> int:
[2025-02-25T12:06:05Z]         """Get the vocab size of the model and make sure it's consistent between
[2025-02-25T12:06:05Z]         draft and target workers.
[2025-02-25T12:06:05Z]         """
[2025-02-25T12:06:05Z]         vocab_sizes = [
[2025-02-25T12:06:05Z]             worker.vocab_size
[2025-02-25T12:06:05Z]             for worker in [self.proposer_worker, self.scorer_worker]
[2025-02-25T12:06:05Z]         ]
[2025-02-25T12:06:05Z] >       assert all(vocab_sizes[0] == vocab_size for vocab_size in vocab_sizes)
[2025-02-25T12:06:05Z] E       AssertionError

Which appears to be relevant.

mgoin · 2025-02-25T13:07:56Z

Oh this is interesting, I did not consider the need for rollback until this case. Thanks for your work. I think it is crucial to add a test using guided decoding and speculation together since AFAIK we haven't used these together

mgoin · 2025-02-25T13:08:28Z

@russellb @aarnphm

aarnphm · 2025-02-25T13:23:39Z

I will look into this td for the v1. But thanks for making the PR.

southfreebird · 2025-02-25T13:41:19Z

Ok, thank you for pointing out the issues with the tests.

We tried using sd and guided decoding together and were surprised that it didn’t work. Anyway, I’m happy if this code helps you. At least, it works well for our case.

aarnphm · 2025-02-25T13:50:43Z

tests/entrypoints/llm/test_guided_generate.py

can you keep this mdoel? is there a specific reason for using a larger models for ci?

Sorry, my bad.
I thought the 7B model was used initially. Reverting to the 1.5B.

aarnphm · 2025-02-25T13:51:58Z

tests/entrypoints/llm/test_guided_generate.py

Let's try to reduce the amount of change that is not relevant to the PR.

I don't see the diff here. Given that it is a pytest fixture, there is no need to raise exception here.

if the test uses wrong, params, it won't even run the test.

aarnphm · 2025-02-25T14:11:20Z

vllm/model_executor/guided_decoding/xgrammar_decoding.py

qq: why do we need to append 1 to the num_lookahead_slots here?

I double-checked this place; it's safe to remove this +1
Initially, it was a workaround for the bonus token

sg thanks for the explanation.

mgoin · 2025-02-25T18:20:50Z

tests/entrypoints/llm/test_guided_generate.py

change this to be an elif mode == "speculative": and raise an exception in the else case saying unsupported mode

The opposite suggestion in the comment: #13823 (comment)
:)

mgoin · 2025-02-25T18:23:02Z

tests/entrypoints/llm/test_guided_generate.py

I'm not sure I understand how this works - will this fixture now run all of the tests in this file for each entry in params?

It first runs all tests with "autoregressive" and then loads the "speculative" model and runs all the tests in the file with it

actually for pytest fixture this would run both cases, so no need for exception

def test_use_llms(llm): ...

then there would be two tests test_use_llms[llm_autoregressive] and test_use_llms[llm_speculative]

https://docs.pytest.org/en/6.2.x/fixture.html#parametrizing-fixtures

southfreebird · 2025-02-26T21:42:57Z

Hi @mgoin @aarnphm
I rebased the PR and resolved the conflicts.

Do I need to do anything else on my end? Are you waiting for me to make any changes to the code?
I'm not sure if I need to fix this: #13823 (comment)

aarnphm · 2025-02-27T21:46:40Z

hmm it seems like the test failure is not related?

aarnphm · 2025-02-27T21:48:21Z

Can you rename the PR title accordingly?

to [V0][Fix] structured decoding compatibility with speculative decoding

aarnphm

seems like the tests failure is not related, but the tests for this passes so LGTM.

aarnphm

Want to hear more your thoughts on this:

I talked with Woosuk offline, and I think that to make structured outputs properly work with spec decode in v0 will requires cover the cardinality of a small subsets of backends -> spec decode worker (mqa, eagle, draft), which requires a significant amount of work.

Given that we are focusing on moving to v1 soon, I'm thinking if it is better to focus all of the effort there, while in v0 we can say that "structured outputs won't work with spec decode"

I also am not familiar with the spec decode perf in v0, so I don't have much saying here (as I haven't explore spec decode deeply in v0)

cc @benchislett on this

southfreebird · 2025-03-21T11:01:34Z

We are using spec_decoding in v0, and we really need this feature because there are still many changes around it, such as the new guided backend.
This PR fixes a lot of problems in v0 and is very helpful for us. Since this PR resolves many issues and leads to correct behaviour rather than causing a core dump in vLLM, I don’t really understand the motivation for not adding it upstream.
Focusing on v1 might be a good idea, but since it doesn’t fully support spec_decoding, we still want to use spec_decoding in v0.

aarnphm · 2025-03-21T11:28:30Z

I'm also good with supporting a small subsets of spec decode features and clearly stated which one works with structured outputs, given that some of the spec decode API are still being worked on in v1.

southfreebird · 2025-03-21T14:24:32Z

Based on that, does this mean we're good to merge, or do you want something alse from my end?

southfreebird · 2025-03-27T11:15:54Z

Hi team,
I just wanted to check if you’re not planning to merge this PR. Thanks!

ItzAmirreza · 2025-04-11T08:36:16Z

Hi team :)
Any updates?

mergify · 2025-04-18T03:39:19Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @southfreebird.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: southfreebird <yvorott@gmail.com>

…output Signed-off-by: southfreebird <yvorott@gmail.com>

Signed-off-by: southfreebird <yvorott@gmail.com>

russellb · 2025-04-30T01:19:57Z

First of all, thank you very much for your hard work and contribution!

I thought about this and discussed it with some other maintainers; we decided it would be best not to merge this. The reasons are roughly:

This is fixed in V1 via [V1][Feature] Enable Speculative Decoding with Structured Outputs #14702
This is a relatively risky change to core code, and we're trying to spend as little time as possible on V0
We feel time is better spent continuing the development on the V1 side than risking any accidental regressions in V0

Folks are welcome to continue using this in custom builds in the meantime, but I hope it won't be too long before everything needed is supported in V1.

Thank you again for the hard work, and I apologize that the PR has been in limbo for this long.

southfreebird requested review from DarkLight1337, alexm-redhat, comaniac, mgoin, njhill, robertgshaw2-redhat, simon-mo, youkaichao and zhuohan123 as code owners February 25, 2025 11:08

mergify bot added structured-output speculative-decoding labels Feb 25, 2025

southfreebird force-pushed the feature/speculative-decoding-and-guided-output-fix branch 2 times, most recently from fb8d0f3 to 083ea16 Compare February 25, 2025 13:35

aarnphm reviewed Feb 25, 2025

View reviewed changes

mgoin reviewed Feb 25, 2025

View reviewed changes

southfreebird force-pushed the feature/speculative-decoding-and-guided-output-fix branch from a299146 to 845a47f Compare February 26, 2025 21:38

southfreebird force-pushed the feature/speculative-decoding-and-guided-output-fix branch from 25a877b to dd71c5f Compare February 26, 2025 23:56

aarnphm approved these changes Feb 27, 2025

View reviewed changes

southfreebird changed the title ~~Fix xgrammar decoding for speculative decoding~~ [V0][Fix] structured decoding compatibility with speculative decoding Feb 27, 2025

aarnphm requested a review from LiuXiaoxuanPKU March 19, 2025 10:49

southfreebird force-pushed the feature/speculative-decoding-and-guided-output-fix branch from e7305d0 to a717563 Compare March 20, 2025 19:03

mergify bot removed the needs-rebase label Mar 20, 2025

aarnphm reviewed Mar 21, 2025

View reviewed changes

aarnphm requested review from benchislett and mgoin April 18, 2025 03:38

mergify bot added the needs-rebase label Apr 18, 2025

russellb added this to Structured Output Apr 22, 2025

southfreebird added 10 commits April 29, 2025 13:17

Fix xgrammar decoding for speculative decoding

e581f6a

Signed-off-by: southfreebird <yvorott@gmail.com>

Fix tests

5bc4267

Signed-off-by: southfreebird <yvorott@gmail.com>

Fix comments

1390ba6

Signed-off-by: southfreebird <yvorott@gmail.com>

Reduce the number of max_rollback_tokens by 1

7ae7d40

Signed-off-by: southfreebird <yvorott@gmail.com>

Fix rebase issues

3c8e0d1

Signed-off-by: southfreebird <yvorott@gmail.com>

Pythonize draft tokens only in case of structured output

40f291e

Signed-off-by: southfreebird <yvorott@gmail.com>

Add check to MQA scorer and turn off NGram proposals with structured …

1ed9242

…output Signed-off-by: southfreebird <yvorott@gmail.com>

Fix linter

14dcd48

Signed-off-by: southfreebird <yvorott@gmail.com>

Add support for guidance decoding backend

525f055

Signed-off-by: southfreebird <yvorott@gmail.com>

Rebase to main

4e40c90

Signed-off-by: southfreebird <yvorott@gmail.com>

southfreebird force-pushed the feature/speculative-decoding-and-guided-output-fix branch from b71c816 to 4e40c90 Compare April 29, 2025 14:56

mergify bot removed the needs-rebase label Apr 29, 2025

aarnphm self-requested a review April 29, 2025 17:21

russellb closed this Apr 30, 2025

github-project-automation bot moved this to Done in Structured Output Apr 30, 2025

Uh oh!

[V0][Fix] structured decoding compatibility with speculative decoding #13823

[V0][Fix] structured decoding compatibility with speculative decoding #13823

Uh oh!

Conversation

southfreebird commented Feb 25, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 25, 2025

Uh oh!

hmellor commented Feb 25, 2025

Uh oh!

mgoin commented Feb 25, 2025

Uh oh!

mgoin commented Feb 25, 2025

Uh oh!

aarnphm commented Feb 25, 2025

Uh oh!

southfreebird commented Feb 25, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

southfreebird commented Feb 26, 2025

Uh oh!

aarnphm commented Feb 27, 2025

Uh oh!

aarnphm commented Feb 27, 2025

Uh oh!

aarnphm left a comment

Choose a reason for hiding this comment

Uh oh!

aarnphm left a comment

Choose a reason for hiding this comment

Uh oh!

southfreebird commented Mar 21, 2025

Uh oh!

aarnphm commented Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

southfreebird commented Mar 21, 2025

Uh oh!

southfreebird commented Mar 27, 2025

Uh oh!

ItzAmirreza commented Apr 11, 2025

Uh oh!

mergify bot commented Apr 18, 2025

Uh oh!

russellb commented Apr 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

southfreebird commented Feb 25, 2025 •

edited by github-actions bot

Loading

aarnphm commented Mar 21, 2025 •

edited

Loading