[V1][Experimental] Jump-forward decoding #15490

aarnphm · 2025-03-25T18:40:59Z

This PR aims to bring support for jump-forward decoding to vLLM.
Jump-forward decoding is a technique where we prefill m next tokens based on the machine state.

Let's say we have the following JSON grammar: {"nameID": ["value"]}, and the machine state is currently at {"

nameID can have the following possible distribution (given by the LLM):

n am e Id
na m e I d
nam  e Id
...

Heuristically, one could fill in the longest token string into the output_ids from tokenizer.decode. However, this will inadvertently affect the model outputs. This phenomena is often known as Coalescence in structured generations.

From the SGLang blog post to ensure jump decoding doesn't affect sampling distribution, one can implement a retokenization strategy such that it will ensure least amount of interference.

Another implementation that one can think of is that for the set of tokens we can jump, their bitmask has to be unique. (i.e: in the first case of n, na, nam, the bitmask for this pass will contains multiple valid tokens, therefore we won't be able to jump. However, in the case of the next token e, we know that this is the only bitmask, therefore we can then fill it in).
However, the case for single-token bitmask are relatively rare, meaning in most cases we won't be able to utilize the jump mechanism here.

This PR includes an implementation of retokenization where we define a set of rollback_window to compare and get a list of tokens that we can jump. Currently, we set this windows max_rollback_window = 10

I have only tested with r1-distill-qwen-32b with reasoning disabled, on 2 A100s, with the following command, with flashinfer backend, v1:

vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --tensor-parallel-size 2 --max-model-len 32768 --guided-decoding-backend xgrammar

benchmark command:

python benchmark_serving_structured_output.py --dataset xgrammar_bench --structured-output-backend xgrammar --structured-output-ratio 0.7 --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --request-rate 10 --backend vllm

initial results for

this branch:

============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  105.27
Total input tokens:                      270636
Total generated tokens:                  71868
Request throughput (req/s):              9.50
Output token throughput (tok/s):         682.69
Total Token throughput (tok/s):          3253.53
---------------Time to First Token----------------
Mean TTFT (ms):                          82.47
Median TTFT (ms):                        78.53
P99 TTFT (ms):                           145.54
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          44.87
Median TPOT (ms):                        45.11
P99 TPOT (ms):                           49.27
---------------Inter-token Latency----------------
Mean ITL (ms):                           44.02
Median ITL (ms):                         43.97
P99 ITL (ms):                            52.27
==================================================
correct_rate(%) 86.2

main branch:

============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  105.22
Total input tokens:                      270636
Total generated tokens:                  71620
Request throughput (req/s):              9.50
Output token throughput (tok/s):         680.68
Total Token throughput (tok/s):          3252.80
---------------Time to First Token----------------
Mean TTFT (ms):                          108.63
Median TTFT (ms):                        83.07
P99 TTFT (ms):                           635.64
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          47.44
Median TPOT (ms):                        45.29
P99 TPOT (ms):                           96.95
---------------Inter-token Latency----------------
Mean ITL (ms):                           46.28
Median ITL (ms):                         44.26
P99 ITL (ms):                            101.32
==================================================
correct_rate(%) 85.7

We see a 24.0817453742% improvement in mean TTFT, which is neat.

Signed-off-by: Aaron Pham contact@aarnphm.xyz

github-actions · 2025-03-25T18:41:09Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

aarnphm · 2025-03-26T01:34:30Z

Will ping once it is ready, discussing with Yixin atm to clear up some confusion

mergify · 2025-04-01T08:21:13Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @aarnphm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Aaron Pham <contact@aarnphm.xyz>

mergify · 2025-04-17T19:04:44Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @aarnphm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…orward-structured-outputs Signed-off-by: Aaron Pham <contact@aarnphm.xyz>

mergify · 2025-04-24T13:19:58Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @aarnphm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…orward-structured-outputs Signed-off-by: Aaron Pham <contact@aarnphm.xyz>

Signed-off-by: Aaron Pham <contact@aarnphm.xyz>

mmoskal

Generally, the re-tokenization works as follows:

assume the last two tokens in output are "h" and "ell"
the grammar wants string "hello", so it would normally force token "o"
but you retokenize the last few tokens and realize there is token "hello" at the end
so you remove the tokens "h" and "ell" from the KV cache and replace them with token "hello"
I'm not sure vLLM is set up for sequence editing (it wasn't last time I looked into this)

In other words, the re-tokenization doesn't typically affect the newly forced tokens so much as the previous tokens, that are already part of the sequence.

The Outlines blog post you're referencing is just stating the problem, it doesn't describe a solution. I think it's an explanation of why they're not doing jump forward. Correct handling of jump forward requires some notation of canonical tokenization, which that blog post doesn't mention. Models will not generate "h" "ell" "o" when they can generate "hello" as a single token. There is some regularization in training that may make them understand that these are in fact the same, but AFAIK it doesn't affect generation.

Here's an explanation of how to safely convert forced string to forced tokens: https://github.com/guidance-ai/llguidance/blob/main/docs/fast_forward.md#safely-converting-ff-strings-to-ff-tokens

I would also suggest letting the grammar implementation provide forced tokens, not only forced bytes, at least as an option.

vllm/v1/structured_output/__init__.py

Signed-off-by: Aaron Pham <contact@aarnphm.xyz>

mergify · 2025-04-29T19:03:09Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @aarnphm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Aaron Pham <contact@aarnphm.xyz>

…orward-structured-outputs Signed-off-by: Aaron Pham <contact@aarnphm.xyz>

mergify · 2025-04-30T00:31:50Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @aarnphm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…orward-structured-outputs

mergify · 2025-05-02T08:07:47Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @aarnphm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

war5 · 2025-05-13T19:46:46Z

I would appreciate a forced tokens option and not just a forced bytes option. I can attempt to help as well let me know

…orward-structured-outputs

mergify · 2025-05-14T22:46:44Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @aarnphm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…orward-structured-outputs

Signed-off-by: Aaron Pham <contact@aarnphm.xyz>

mergify · 2025-06-23T19:58:42Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @aarnphm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify bot added the v1 label Mar 25, 2025

WoosukKwon assigned russellb and WoosukKwon Mar 25, 2025

mergify bot added tpu Related to Google TPUs and removed tpu Related to Google TPUs labels Mar 27, 2025

mergify bot added the needs-rebase label Apr 1, 2025

mergify bot added tpu Related to Google TPUs and removed tpu Related to Google TPUs labels Apr 9, 2025

aarnphm force-pushed the feat/jump-forward-structured-outputs branch from a1048e6 to 236830d Compare April 13, 2025 02:01

mergify bot removed the needs-rebase label Apr 13, 2025

aarnphm added 2 commits April 14, 2025 15:43

chore: migrate tokenizer init to manager only

81aadb6

Signed-off-by: Aaron Pham <contact@aarnphm.xyz>

--wip--

a97b172

Signed-off-by: Aaron Pham <contact@aarnphm.xyz>

aarnphm force-pushed the feat/jump-forward-structured-outputs branch from 236830d to a97b172 Compare April 17, 2025 19:04

mergify bot added the needs-rebase label Apr 17, 2025

merge: branch 'main' of github.com:vllm-project/vllm into feat/jump-f…

b15d00f

…orward-structured-outputs Signed-off-by: Aaron Pham <contact@aarnphm.xyz>

mergify bot added structured-output and removed needs-rebase labels Apr 23, 2025

github-project-automation bot added this to Structured Output Apr 23, 2025

russellb moved this to In progress in Structured Output Apr 23, 2025

shen-shanshan mentioned this pull request Apr 24, 2025

[Feature]: Add Support for Guided Decoding (Structured Output) vllm-project/vllm-ascend#177

Closed

20 tasks

mergify bot added the needs-rebase label Apr 24, 2025

aarnphm added 2 commits April 26, 2025 13:05

merge: branch 'main' of github.com:vllm-project/vllm into feat/jump-f…

d612f85

…orward-structured-outputs Signed-off-by: Aaron Pham <contact@aarnphm.xyz>

chore: remove unused functions

26f8a25

Signed-off-by: Aaron Pham <contact@aarnphm.xyz>

mergify bot removed the needs-rebase label Apr 26, 2025

aarnphm force-pushed the feat/jump-forward-structured-outputs branch from 0ed5156 to ece4d4e Compare April 26, 2025 13:22

aarnphm requested a review from WoosukKwon April 29, 2025 11:37

mmoskal reviewed Apr 29, 2025

View reviewed changes

vllm/v1/structured_output/__init__.py Outdated Show resolved Hide resolved

chore: add a mock test case --wip--

a7c8070

Signed-off-by: Aaron Pham <contact@aarnphm.xyz>

mergify bot added the needs-rebase label Apr 29, 2025

fix: align output_ids to correct retokenized windows

13b6c19

Signed-off-by: Aaron Pham <contact@aarnphm.xyz>

mergify bot removed the needs-rebase label Apr 29, 2025

merge: branch 'main' of github.com:vllm-project/vllm into feat/jump-f…

7d26f48

…orward-structured-outputs Signed-off-by: Aaron Pham <contact@aarnphm.xyz>

aarnphm force-pushed the feat/jump-forward-structured-outputs branch from d04901c to 7d26f48 Compare April 29, 2025 22:11

mergify bot added the needs-rebase label Apr 30, 2025

merge: branch 'main' of github.com:vllm-project/vllm into feat/jump-f…

372bcda

…orward-structured-outputs

mergify bot removed the needs-rebase label Apr 30, 2025

mergify bot added the needs-rebase label May 2, 2025

aarnphm marked this pull request as draft May 7, 2025 18:38

merge: branch 'main' of github.com:vllm-project/vllm into feat/jump-f…

1262acc

…orward-structured-outputs

mergify bot removed the needs-rebase label May 14, 2025

mergify bot added the needs-rebase label May 14, 2025

aarnphm force-pushed the feat/jump-forward-structured-outputs branch from 873c485 to 1262acc Compare May 15, 2025 00:21

merge: branch 'main' of github.com:vllm-project/vllm into feat/jump-f…

d89a660

…orward-structured-outputs

mergify bot removed the needs-rebase label May 15, 2025

fix: revert bad merge

93cd93f

Signed-off-by: Aaron Pham <contact@aarnphm.xyz>

aarnphm force-pushed the feat/jump-forward-structured-outputs branch from 46db5c4 to 93cd93f Compare May 15, 2025 00:26

revert: remove jump forward tests implementation for now

f580263

Signed-off-by: Aaron Pham <contact@aarnphm.xyz>

mergify bot added the qwen Related to Qwen models label Jun 19, 2025

mergify bot added the needs-rebase label Jun 23, 2025

Uh oh!

Uh oh!

[V1][Experimental] Jump-forward decoding #15490

Are you sure you want to change the base?

[V1][Experimental] Jump-forward decoding #15490

Conversation

aarnphm commented Mar 25, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 25, 2025

Uh oh!

aarnphm commented Mar 26, 2025

Uh oh!

mergify bot commented Apr 1, 2025

Uh oh!

mergify bot commented Apr 17, 2025

Uh oh!

mergify bot commented Apr 24, 2025

Uh oh!

mmoskal left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mergify bot commented Apr 29, 2025

Uh oh!

mergify bot commented Apr 30, 2025

Uh oh!

mergify bot commented May 2, 2025

Uh oh!

war5 commented May 13, 2025

Uh oh!

mergify bot commented May 14, 2025

Uh oh!

mergify bot commented Jun 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

aarnphm commented Mar 25, 2025 •

edited by github-actions bot

Loading