Skip to content

Conversation

@southfreebird
Copy link
Contributor

@southfreebird southfreebird commented Feb 25, 2025

This PR was created by the Nebius team.

The main focus of this PR is to fix guided generation for speculative decoding. We found that when using the xGrammar backend with speculative decoding, vLLM crashes here. This PR addresses the issue by using a rollback mechanism in xGrammar.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@hmellor
Copy link
Member

hmellor commented Feb 25, 2025

Thanks for the contribution!

Please can you make sure that your commits are signed off (instructions here).

Also, some of the entrypoint tests are failing with:

[2025-02-25T12:06:05Z]     def _vocab_size(self) -> int:
[2025-02-25T12:06:05Z]         """Get the vocab size of the model and make sure it's consistent between
[2025-02-25T12:06:05Z]         draft and target workers.
[2025-02-25T12:06:05Z]         """
[2025-02-25T12:06:05Z]         vocab_sizes = [
[2025-02-25T12:06:05Z]             worker.vocab_size
[2025-02-25T12:06:05Z]             for worker in [self.proposer_worker, self.scorer_worker]
[2025-02-25T12:06:05Z]         ]
[2025-02-25T12:06:05Z] >       assert all(vocab_sizes[0] == vocab_size for vocab_size in vocab_sizes)
[2025-02-25T12:06:05Z] E       AssertionError

Which appears to be relevant.

@mgoin
Copy link
Member

mgoin commented Feb 25, 2025

Oh this is interesting, I did not consider the need for rollback until this case. Thanks for your work. I think it is crucial to add a test using guided decoding and speculation together since AFAIK we haven't used these together

@mgoin
Copy link
Member

mgoin commented Feb 25, 2025

@russellb @aarnphm

@aarnphm
Copy link
Collaborator

aarnphm commented Feb 25, 2025

I will look into this td for the v1. But thanks for making the PR.

@southfreebird southfreebird force-pushed the feature/speculative-decoding-and-guided-output-fix branch 2 times, most recently from fb8d0f3 to 083ea16 Compare February 25, 2025 13:35
@southfreebird
Copy link
Contributor Author

Ok, thank you for pointing out the issues with the tests.

We tried using sd and guided decoding together and were surprised that it didn’t work. Anyway, I’m happy if this code helps you. At least, it works well for our case.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you keep this mdoel? is there a specific reason for using a larger models for ci?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, my bad.
I thought the 7B model was used initially. Reverting to the 1.5B.

Comment on lines 19 to 34
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try to reduce the amount of change that is not relevant to the PR.

I don't see the diff here. Given that it is a pytest fixture, there is no need to raise exception here.

if the test uses wrong, params, it won't even run the test.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq: why do we need to append 1 to the num_lookahead_slots here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double-checked this place; it's safe to remove this +1
Initially, it was a workaround for the bonus token

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sg thanks for the explanation.

Comment on lines 22 to 37
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change this to be an elif mode == "speculative": and raise an exception in the else case saying unsupported mode

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The opposite suggestion in the comment: #13823 (comment)
:)

Comment on lines 19 to 35
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand how this works - will this fixture now run all of the tests in this file for each entry in params?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It first runs all tests with "autoregressive" and then loads the "speculative" model and runs all the tests in the file with it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually for pytest fixture this would run both cases, so no need for exception

def test_use_llms(llm):
    ...

then there would be two tests test_use_llms[llm_autoregressive] and test_use_llms[llm_speculative]

https://docs.pytest.org/en/6.2.x/fixture.html#parametrizing-fixtures

@southfreebird southfreebird force-pushed the feature/speculative-decoding-and-guided-output-fix branch from a299146 to 845a47f Compare February 26, 2025 21:38
@southfreebird
Copy link
Contributor Author

Hi @mgoin @aarnphm
I rebased the PR and resolved the conflicts.

Do I need to do anything else on my end? Are you waiting for me to make any changes to the code?
I'm not sure if I need to fix this: #13823 (comment)

@southfreebird southfreebird force-pushed the feature/speculative-decoding-and-guided-output-fix branch from 25a877b to dd71c5f Compare February 26, 2025 23:56
@aarnphm
Copy link
Collaborator

aarnphm commented Feb 27, 2025

hmm it seems like the test failure is not related?

@aarnphm
Copy link
Collaborator

aarnphm commented Feb 27, 2025

Can you rename the PR title accordingly?

to [V0][Fix] structured decoding compatibility with speculative decoding

Copy link
Collaborator

@aarnphm aarnphm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like the tests failure is not related, but the tests for this passes so LGTM.

@southfreebird southfreebird changed the title Fix xgrammar decoding for speculative decoding [V0][Fix] structured decoding compatibility with speculative decoding Feb 27, 2025
@aarnphm aarnphm requested a review from LiuXiaoxuanPKU March 19, 2025 10:49
@southfreebird southfreebird force-pushed the feature/speculative-decoding-and-guided-output-fix branch from e7305d0 to a717563 Compare March 20, 2025 19:03
@mergify mergify bot removed the needs-rebase label Mar 20, 2025
Copy link
Collaborator

@aarnphm aarnphm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Want to hear more your thoughts on this:

I talked with Woosuk offline, and I think that to make structured outputs properly work with spec decode in v0 will requires cover the cardinality of a small subsets of backends -> spec decode worker (mqa, eagle, draft), which requires a significant amount of work.

Given that we are focusing on moving to v1 soon, I'm thinking if it is better to focus all of the effort there, while in v0 we can say that "structured outputs won't work with spec decode"

I also am not familiar with the spec decode perf in v0, so I don't have much saying here (as I haven't explore spec decode deeply in v0)

cc @benchislett on this

@southfreebird
Copy link
Contributor Author

We are using spec_decoding in v0, and we really need this feature because there are still many changes around it, such as the new guided backend.
This PR fixes a lot of problems in v0 and is very helpful for us. Since this PR resolves many issues and leads to correct behaviour rather than causing a core dump in vLLM, I don’t really understand the motivation for not adding it upstream.
Focusing on v1 might be a good idea, but since it doesn’t fully support spec_decoding, we still want to use spec_decoding in v0.

@aarnphm
Copy link
Collaborator

aarnphm commented Mar 21, 2025

I'm also good with supporting a small subsets of spec decode features and clearly stated which one works with structured outputs, given that some of the spec decode API are still being worked on in v1.

@southfreebird
Copy link
Contributor Author

Based on that, does this mean we're good to merge, or do you want something alse from my end?

@southfreebird
Copy link
Contributor Author

Hi team,
I just wanted to check if you’re not planning to merge this PR. Thanks!

@ItzAmirreza
Copy link

Hi team :)
Any updates?

@aarnphm aarnphm requested review from benchislett and mgoin April 18, 2025 03:38
@mergify
Copy link

mergify bot commented Apr 18, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @southfreebird.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: southfreebird <yvorott@gmail.com>
Signed-off-by: southfreebird <yvorott@gmail.com>
Signed-off-by: southfreebird <yvorott@gmail.com>
Signed-off-by: southfreebird <yvorott@gmail.com>
Signed-off-by: southfreebird <yvorott@gmail.com>
Signed-off-by: southfreebird <yvorott@gmail.com>
…output

Signed-off-by: southfreebird <yvorott@gmail.com>
Signed-off-by: southfreebird <yvorott@gmail.com>
Signed-off-by: southfreebird <yvorott@gmail.com>
Signed-off-by: southfreebird <yvorott@gmail.com>
@southfreebird southfreebird force-pushed the feature/speculative-decoding-and-guided-output-fix branch from b71c816 to 4e40c90 Compare April 29, 2025 14:56
@mergify mergify bot removed the needs-rebase label Apr 29, 2025
@aarnphm aarnphm self-requested a review April 29, 2025 17:21
@russellb
Copy link
Member

First of all, thank you very much for your hard work and contribution!

I thought about this and discussed it with some other maintainers; we decided it would be best not to merge this. The reasons are roughly:

Folks are welcome to continue using this in custom builds in the meantime, but I hope it won't be too long before everything needed is supported in V1.

Thank you again for the hard work, and I apologize that the PR has been in limbo for this long.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

7 participants