-
-
Notifications
You must be signed in to change notification settings - Fork 11k
[V0][Fix] structured decoding compatibility with speculative decoding #13823
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[V0][Fix] structured decoding compatibility with speculative decoding #13823
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
|
Thanks for the contribution! Please can you make sure that your commits are signed off (instructions here). Also, some of the entrypoint tests are failing with: Which appears to be relevant. |
|
Oh this is interesting, I did not consider the need for rollback until this case. Thanks for your work. I think it is crucial to add a test using guided decoding and speculation together since AFAIK we haven't used these together |
|
I will look into this td for the v1. But thanks for making the PR. |
fb8d0f3 to
083ea16
Compare
|
Ok, thank you for pointing out the issues with the tests. We tried using sd and guided decoding together and were surprised that it didn’t work. Anyway, I’m happy if this code helps you. At least, it works well for our case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you keep this mdoel? is there a specific reason for using a larger models for ci?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, my bad.
I thought the 7B model was used initially. Reverting to the 1.5B.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's try to reduce the amount of change that is not relevant to the PR.
I don't see the diff here. Given that it is a pytest fixture, there is no need to raise exception here.
if the test uses wrong, params, it won't even run the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
qq: why do we need to append 1 to the num_lookahead_slots here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I double-checked this place; it's safe to remove this +1
Initially, it was a workaround for the bonus token
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sg thanks for the explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change this to be an elif mode == "speculative": and raise an exception in the else case saying unsupported mode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The opposite suggestion in the comment: #13823 (comment)
:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand how this works - will this fixture now run all of the tests in this file for each entry in params?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It first runs all tests with "autoregressive" and then loads the "speculative" model and runs all the tests in the file with it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually for pytest fixture this would run both cases, so no need for exception
def test_use_llms(llm):
...then there would be two tests test_use_llms[llm_autoregressive] and test_use_llms[llm_speculative]
https://docs.pytest.org/en/6.2.x/fixture.html#parametrizing-fixtures
a299146 to
845a47f
Compare
|
Hi @mgoin @aarnphm Do I need to do anything else on my end? Are you waiting for me to make any changes to the code? |
25a877b to
dd71c5f
Compare
|
hmm it seems like the test failure is not related? |
|
Can you rename the PR title accordingly? to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like the tests failure is not related, but the tests for this passes so LGTM.
e7305d0 to
a717563
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Want to hear more your thoughts on this:
I talked with Woosuk offline, and I think that to make structured outputs properly work with spec decode in v0 will requires cover the cardinality of a small subsets of backends -> spec decode worker (mqa, eagle, draft), which requires a significant amount of work.
Given that we are focusing on moving to v1 soon, I'm thinking if it is better to focus all of the effort there, while in v0 we can say that "structured outputs won't work with spec decode"
I also am not familiar with the spec decode perf in v0, so I don't have much saying here (as I haven't explore spec decode deeply in v0)
cc @benchislett on this
|
We are using spec_decoding in v0, and we really need this feature because there are still many changes around it, such as the new guided backend. |
|
I'm also good with supporting a small subsets of spec decode features and clearly stated which one works with structured outputs, given that some of the spec decode API are still being worked on in v1. |
|
Based on that, does this mean we're good to merge, or do you want something alse from my end? |
|
Hi team, |
|
Hi team :) |
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: southfreebird <yvorott@gmail.com>
Signed-off-by: southfreebird <yvorott@gmail.com>
Signed-off-by: southfreebird <yvorott@gmail.com>
Signed-off-by: southfreebird <yvorott@gmail.com>
Signed-off-by: southfreebird <yvorott@gmail.com>
…output Signed-off-by: southfreebird <yvorott@gmail.com>
Signed-off-by: southfreebird <yvorott@gmail.com>
Signed-off-by: southfreebird <yvorott@gmail.com>
Signed-off-by: southfreebird <yvorott@gmail.com>
b71c816 to
4e40c90
Compare
|
First of all, thank you very much for your hard work and contribution! I thought about this and discussed it with some other maintainers; we decided it would be best not to merge this. The reasons are roughly:
Folks are welcome to continue using this in custom builds in the meantime, but I hope it won't be too long before everything needed is supported in V1. Thank you again for the hard work, and I apologize that the PR has been in limbo for this long. |
This PR was created by the Nebius team.
The main focus of this PR is to fix guided generation for speculative decoding. We found that when using the xGrammar backend with speculative decoding, vLLM crashes here. This PR addresses the issue by using a rollback mechanism in xGrammar.