-
-
Couldn't load subscription status.
- Fork 10.9k
[V1][Experimental] Jump-forward decoding #15490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[V1][Experimental] Jump-forward decoding #15490
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
|
Will ping once it is ready, discussing with Yixin atm to clear up some confusion |
|
This pull request has merge conflicts that must be resolved before it can be |
a1048e6 to
236830d
Compare
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
236830d to
a97b172
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
…orward-structured-outputs Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
|
This pull request has merge conflicts that must be resolved before it can be |
…orward-structured-outputs Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
0ed5156 to
ece4d4e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, the re-tokenization works as follows:
- assume the last two tokens in output are "h" and "ell"
- the grammar wants string "hello", so it would normally force token "o"
- but you retokenize the last few tokens and realize there is token "hello" at the end
- so you remove the tokens "h" and "ell" from the KV cache and replace them with token "hello"
- I'm not sure vLLM is set up for sequence editing (it wasn't last time I looked into this)
In other words, the re-tokenization doesn't typically affect the newly forced tokens so much as the previous tokens, that are already part of the sequence.
The Outlines blog post you're referencing is just stating the problem, it doesn't describe a solution. I think it's an explanation of why they're not doing jump forward. Correct handling of jump forward requires some notation of canonical tokenization, which that blog post doesn't mention. Models will not generate "h" "ell" "o" when they can generate "hello" as a single token. There is some regularization in training that may make them understand that these are in fact the same, but AFAIK it doesn't affect generation.
Here's an explanation of how to safely convert forced string to forced tokens: https://github.com/guidance-ai/llguidance/blob/main/docs/fast_forward.md#safely-converting-ff-strings-to-ff-tokens
I would also suggest letting the grammar implementation provide forced tokens, not only forced bytes, at least as an option.
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
…orward-structured-outputs Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
d04901c to
7d26f48
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
…orward-structured-outputs
|
This pull request has merge conflicts that must be resolved before it can be |
|
I would appreciate a forced tokens option and not just a forced bytes option. I can attempt to help as well let me know |
…orward-structured-outputs
|
This pull request has merge conflicts that must be resolved before it can be |
873c485 to
1262acc
Compare
…orward-structured-outputs
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
46db5c4 to
93cd93f
Compare
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
|
This pull request has merge conflicts that must be resolved before it can be |
This PR aims to bring support for jump-forward decoding to vLLM.
Jump-forward decoding is a technique where we prefill
mnext tokens based on the machine state.Let's say we have the following JSON grammar:
{"nameID": ["value"]}, and the machine state is currently at{"nameIDcan have the following possible distribution (given by the LLM):Heuristically, one could fill in the longest token string into the output_ids from
tokenizer.decode. However, this will inadvertently affect the model outputs. This phenomena is often known as Coalescence in structured generations.From the SGLang blog post to ensure jump decoding doesn't affect sampling distribution, one can implement a retokenization strategy such that it will ensure least amount of interference.
Another implementation that one can think of is that for the set of tokens we can jump, their bitmask has to be unique. (i.e: in the first case of
n,na,nam, the bitmask for this pass will contains multiple valid tokens, therefore we won't be able to jump. However, in the case of the next tokene, we know that this is the only bitmask, therefore we can then fill it in).However, the case for single-token bitmask are relatively rare, meaning in most cases we won't be able to utilize the jump mechanism here.
This PR includes an implementation of retokenization where we define a set of rollback_window to compare and get a list of tokens that we can jump. Currently, we set this windows
max_rollback_window = 10I have only tested with r1-distill-qwen-32b with reasoning disabled, on 2 A100s, with the following command, with flashinfer backend, v1:
benchmark command:
initial results for
We see a 24.0817453742% improvement in mean TTFT, which is neat.
Signed-off-by: Aaron Pham contact@aarnphm.xyz