Skip to content

Conversation

@aarnphm
Copy link
Collaborator

@aarnphm aarnphm commented Mar 25, 2025

This PR aims to bring support for jump-forward decoding to vLLM.
Jump-forward decoding is a technique where we prefill m next tokens based on the machine state.

Let's say we have the following JSON grammar: {"nameID": ["value"]}, and the machine state is currently at {"

nameID can have the following possible distribution (given by the LLM):

n am e Id
na m e I d
nam  e Id
...

Heuristically, one could fill in the longest token string into the output_ids from tokenizer.decode. However, this will inadvertently affect the model outputs. This phenomena is often known as Coalescence in structured generations.

From the SGLang blog post to ensure jump decoding doesn't affect sampling distribution, one can implement a retokenization strategy such that it will ensure least amount of interference.

Another implementation that one can think of is that for the set of tokens we can jump, their bitmask has to be unique. (i.e: in the first case of n, na, nam, the bitmask for this pass will contains multiple valid tokens, therefore we won't be able to jump. However, in the case of the next token e, we know that this is the only bitmask, therefore we can then fill it in).
However, the case for single-token bitmask are relatively rare, meaning in most cases we won't be able to utilize the jump mechanism here.

This PR includes an implementation of retokenization where we define a set of rollback_window to compare and get a list of tokens that we can jump. Currently, we set this windows max_rollback_window = 10

I have only tested with r1-distill-qwen-32b with reasoning disabled, on 2 A100s, with the following command, with flashinfer backend, v1:

vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --tensor-parallel-size 2 --max-model-len 32768 --guided-decoding-backend xgrammar

benchmark command:

python benchmark_serving_structured_output.py --dataset xgrammar_bench --structured-output-backend xgrammar --structured-output-ratio 0.7 --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --request-rate 10 --backend vllm

initial results for

  • this branch:
============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  105.27
Total input tokens:                      270636
Total generated tokens:                  71868
Request throughput (req/s):              9.50
Output token throughput (tok/s):         682.69
Total Token throughput (tok/s):          3253.53
---------------Time to First Token----------------
Mean TTFT (ms):                          82.47
Median TTFT (ms):                        78.53
P99 TTFT (ms):                           145.54
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          44.87
Median TPOT (ms):                        45.11
P99 TPOT (ms):                           49.27
---------------Inter-token Latency----------------
Mean ITL (ms):                           44.02
Median ITL (ms):                         43.97
P99 ITL (ms):                            52.27
==================================================
correct_rate(%) 86.2
  • main branch:
============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  105.22
Total input tokens:                      270636
Total generated tokens:                  71620
Request throughput (req/s):              9.50
Output token throughput (tok/s):         680.68
Total Token throughput (tok/s):          3252.80
---------------Time to First Token----------------
Mean TTFT (ms):                          108.63
Median TTFT (ms):                        83.07
P99 TTFT (ms):                           635.64
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          47.44
Median TPOT (ms):                        45.29
P99 TPOT (ms):                           96.95
---------------Inter-token Latency----------------
Mean ITL (ms):                           46.28
Median ITL (ms):                         44.26
P99 ITL (ms):                            101.32
==================================================
correct_rate(%) 85.7

We see a 24.0817453742% improvement in mean TTFT, which is neat.

Signed-off-by: Aaron Pham contact@aarnphm.xyz

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@aarnphm
Copy link
Collaborator Author

aarnphm commented Mar 26, 2025

Will ping once it is ready, discussing with Yixin atm to clear up some confusion

@mergify mergify bot added tpu Related to Google TPUs and removed tpu Related to Google TPUs labels Mar 27, 2025
@mergify
Copy link

mergify bot commented Apr 1, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @aarnphm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Apr 1, 2025
@mergify mergify bot added tpu Related to Google TPUs and removed tpu Related to Google TPUs labels Apr 9, 2025
@aarnphm aarnphm force-pushed the feat/jump-forward-structured-outputs branch from a1048e6 to 236830d Compare April 13, 2025 02:01
@mergify mergify bot removed the needs-rebase label Apr 13, 2025
aarnphm added 2 commits April 14, 2025 15:43
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
@aarnphm aarnphm force-pushed the feat/jump-forward-structured-outputs branch from 236830d to a97b172 Compare April 17, 2025 19:04
@mergify
Copy link

mergify bot commented Apr 17, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @aarnphm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Apr 17, 2025
…orward-structured-outputs

Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
@mergify
Copy link

mergify bot commented Apr 24, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @aarnphm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Apr 24, 2025
aarnphm added 2 commits April 26, 2025 13:05
…orward-structured-outputs

Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
@mergify mergify bot removed the needs-rebase label Apr 26, 2025
@aarnphm aarnphm force-pushed the feat/jump-forward-structured-outputs branch from 0ed5156 to ece4d4e Compare April 26, 2025 13:22
@aarnphm aarnphm requested a review from WoosukKwon April 29, 2025 11:37
Copy link
Contributor

@mmoskal mmoskal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, the re-tokenization works as follows:

  • assume the last two tokens in output are "h" and "ell"
  • the grammar wants string "hello", so it would normally force token "o"
  • but you retokenize the last few tokens and realize there is token "hello" at the end
  • so you remove the tokens "h" and "ell" from the KV cache and replace them with token "hello"
  • I'm not sure vLLM is set up for sequence editing (it wasn't last time I looked into this)

In other words, the re-tokenization doesn't typically affect the newly forced tokens so much as the previous tokens, that are already part of the sequence.

The Outlines blog post you're referencing is just stating the problem, it doesn't describe a solution. I think it's an explanation of why they're not doing jump forward. Correct handling of jump forward requires some notation of canonical tokenization, which that blog post doesn't mention. Models will not generate "h" "ell" "o" when they can generate "hello" as a single token. There is some regularization in training that may make them understand that these are in fact the same, but AFAIK it doesn't affect generation.

Here's an explanation of how to safely convert forced string to forced tokens: https://github.com/guidance-ai/llguidance/blob/main/docs/fast_forward.md#safely-converting-ff-strings-to-ff-tokens

I would also suggest letting the grammar implementation provide forced tokens, not only forced bytes, at least as an option.

Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
@mergify
Copy link

mergify bot commented Apr 29, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @aarnphm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Apr 29, 2025
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
@mergify mergify bot removed the needs-rebase label Apr 29, 2025
…orward-structured-outputs

Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
@aarnphm aarnphm force-pushed the feat/jump-forward-structured-outputs branch from d04901c to 7d26f48 Compare April 29, 2025 22:11
@mergify
Copy link

mergify bot commented Apr 30, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @aarnphm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Apr 30, 2025
@mergify mergify bot removed the needs-rebase label Apr 30, 2025
@mergify
Copy link

mergify bot commented May 2, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @aarnphm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label May 2, 2025
@aarnphm aarnphm marked this pull request as draft May 7, 2025 18:38
@war5
Copy link

war5 commented May 13, 2025

I would appreciate a forced tokens option and not just a forced bytes option. I can attempt to help as well let me know

@mergify mergify bot removed the needs-rebase label May 14, 2025
@mergify
Copy link

mergify bot commented May 14, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @aarnphm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label May 14, 2025
@aarnphm aarnphm force-pushed the feat/jump-forward-structured-outputs branch from 873c485 to 1262acc Compare May 15, 2025 00:21
@mergify mergify bot removed the needs-rebase label May 15, 2025
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
@aarnphm aarnphm force-pushed the feat/jump-forward-structured-outputs branch from 46db5c4 to 93cd93f Compare May 15, 2025 00:26
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
@mergify mergify bot added the qwen Related to Qwen models label Jun 19, 2025
@mergify
Copy link

mergify bot commented Jun 23, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @aarnphm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jun 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

5 participants