[ V0 ][ sample ] improve sample performance when using guide decoding #14962

cjsdurj · 2025-03-17T14:28:02Z

Introduction

this pr introduce a new sample algorithm ( SampleV2) to improve guide decoding performance.

currently support xgrammer backend and qwen , llama model .

refer to interface 'vllm. model_executor. models. interfaces.SupportsSampleV2'.

this PR increased ebnf guide decode throughput by more than 1000% .

background

guide decode backends like xgr , outlines open only optimizes the speed of json guide decoding ebnf , but when use other gbnf grammar , the throughput of tokens can be very slow.

in my test case, the grammar is below:

root        ::= en-char+ ([ \t\n] en-char+)*
en-char     ::= letter | digit | punctuation
letter      ::= [a-zA-Z]
digit       ::= [0-9]
punctuation ::= [!"#$%&'()*+,-./:;<=>?@[\\\]^_`{|}~]

when vllm use 4x L40s GPU serve a qwen72b awq 4bit model, the decode speed is 50 tokens/s. but applied guided_grammar, decode speed is 2.5 tokens /s .

I optimized the grammar as below , then it outputs 12 tokens/s:

root        ::= (en-char+ [ \t\n])*  en-char+
en-char     ::= letter | digit | punctuation
letter      ::= [a-zA-Z]
digit       ::= [0-9]
punctuation ::= [!"#$%&'()*+,-./:;<=>?@[\\\]^_`{|}~]

the bottleneck of guide decode is matcher.FillNextTokenBitmask (see this issue [performance] EBNF grammer mlc-ai/xgrammar#235) . before in every decode step the function FillNextTokenBitmask is called , so this pr aimed to reduce the times to call matcher.FillNextTokenBitmask.

how it works

    def samplev2(
            self,
            logits: torch.Tensor,
            sampling_metadata: SamplingMetadata,
    )->Optional[SamplerOutput]:
        # compute logits
        next_tokens: SamplerOutput = self.sampler(logits, sampling_metadata)

        # check if  the sampled tokens fit the grammars
        tks = torch.tensor([o.samples[0].output_token for o in next_tokens.outputs])
        accepted = accept_grammar(tks, sampling_metadata)
        need_resample = torch.logical_not(accepted)
        if accepted.all():
            return next_tokens
        # resample
        # if the token is not valid, sample again, but first apply the grammar bitmask
        # only apply logits processor when need_resample
        logits = _apply_logits_processors(logits, sampling_metadata, need_resample, False)
        new_next_tokens: SamplerOutput = self.sampler(logits, sampling_metadata)

        for i, replace in enumerate(need_resample.tolist()):
            if replace:
                next_tokens.outputs[i] = new_next_tokens.outputs[i]

        tks = torch.tensor([o.samples[0].output_token for o in next_tokens.outputs])
        # matcher only accept next token when first round is not accepted.
        accepted = accept_grammar(tks, sampling_metadata, need_resample)
        assert accepted.all()
        return next_tokens

The new sampling method (samplev2) consists of the following steps:

a. compute logits from hidden_states.
b. sample the batched outout without grammar guide.
c. let FSM try to accept the output tokens , if call accept, return directly ( did not call FillNextTokenBitmask).
d. if FSM can not accept , apply grammar guide to logits , and resample .

test

ebnfstr = '''
root        ::= en-char+ ([ \t\n] en-char+)*
en-char     ::= letter | digit | punctuation
letter      ::= [a-zA-Z]
digit       ::= [0-9]
punctuation ::= [!"#$%&'()*+,-./:;<=>?@[\\\]^_`{|}~]
'''



payload = {
        "messages": [
            {
                "content": "tell a story about Spring ,at least 1024 words",
                "role": "user"
            }
        ],
        "max_tokens": 1024,
        "model": "llama2",
        "stream": False,
        "guided_grammar":  ebnfstr
    }

single clinet test result (nearly no cost when apply guided_grammar):

model	not guided (tokens /s)	before (tokens /s)	this pr (tokens /s)
qwen2.5 1.5b fp16 1*L40s	140	2	136
qwen2.5 72b awq4 4*L40s	50	2	48

current this pr only supported backend xgrammar and models like qwen2 ,llama . more model can be supported by extend class SupportsSampleV2

github-actions · 2025-03-17T14:28:14Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

DarkLight1337 · 2025-03-17T14:46:20Z

Sorry, can you merge from latest main to fix the pre-commit failures?

cjsdurj · 2025-03-21T09:19:04Z

Sorry, can you merge from latest main to fix the pre-commit failures?

ok. l have fixed the pre-commit faulures

russellb · 2025-03-21T13:49:43Z

Sorry, can you merge from latest main to fix the pre-commit failures?

ok. l have fixed the pre-commit faulures

All commit messages also need Signed-off-by headers to make the DCO check pass.

About the change -- the results are impressive, but I'm a bit concerned about what we'd do with this for V1. As written, this will only work with the feature for V0, but we're trying to focus our enhancements on V1 as much as possible.

Have you compared this to V1? That would be good as another column in your comparison. In other words, does V0 + this enhancement beat V1 with structured output in use? Or do the other enhancements in V1 already make V1 faster without this optimization in place?

mergify · 2025-03-21T16:34:48Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @cjsdurj.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

russellb · 2025-03-21T20:14:45Z

@cjsdurj did you mean to close this?

Signed-off-by: cj <2465188464@qq.com>

cjsdurj · 2025-03-22T04:07:52Z

@cjsdurj did you mean to close this?
sorry. I have make a mistake last night when address DCO check.

this feature also work on V1. and I am working on it.
I have test v1 engine , the gbnf decode performance is as slow as v0 .

because it changes too many py files , currently it only works on V0 and qwen llama model as a preview feature.

cjsdurj · 2025-03-22T04:11:25Z

@cjsdurj did you mean to close this?
sorry. I have make a mistake last night when address DCO check.

this feature also work on V1. and I am working on it.

I have test v1 engine , the gbnf decode performance is as slow as v0 .

because it changes too many py files , currently it only works on V0 and qwen llama model as a preview feature.

if this works ok. In subsequent submissions, I will Implement this feature in V1 engine for more models.

russellb · 2025-03-22T14:26:40Z

@cjsdurj did you mean to close this?
sorry. I have make a mistake last night when address DCO check.

this feature also work on V1. and I am working on it.

I have test v1 engine , the gbnf decode performance is as slow as v0 .

because it changes too many py files , currently it only works on V0 and qwen llama model as a preview feature.

if this works ok. In subsequent submissions, I will Implement this feature in V1 engine for more models.

It will not work with V1 by design right now. In V1, advancing the grammar's FSM and applying the bitmask are in separate processes.

I'd like to see more performance numbers, in particular with large batches of requests and not just a single request. I'll do some benchmarking at some point if you don't have the hardware for it (I'll want to see some H100 results).

lk-chen · 2025-04-29T19:40:37Z

#17084 removed sampler from model, this PR needs rebase.

~~Let me see if I can help~~ cjsdurj#1

lk-chen · 2025-04-29T19:40:41Z

#17084 removed sampler from model, this PR needs rebase.
Let me see if I can help

mergify · 2025-04-29T21:15:14Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @cjsdurj.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

lk-chen · 2025-04-30T18:41:53Z

Hi @cjsdurj , may I ask how to produce before throughput of 2tk/s and after throughput of 136 tk/s ?

I'm using lk-chen#2 on L40S, forcing vLLM v0, model=Qwen/Qwen2.5-1.5B-Instruct, async mode, and got

num_prompts	output trpt. before this pr (tk/s)	output trpt. after this pr
1	12.94	12.95
100	69.41	79.33

russellb · 2025-05-05T22:16:23Z

Since this only applies to V0 as written, I'm going to close this out for now. V0 is deprecated and we're not adding major changes at this point. I think significant changes are necessary to come up with something that works with V1. Please feel free to reopen this or open a new PR if you come up with an approach that works in V1. Thank you!

cjsdurj requested review from mgoin and russellb as code owners March 17, 2025 14:28

mergify bot added the structured-output label Mar 17, 2025

mergify bot added the ci/build label Mar 21, 2025

cjsdurj changed the title ~~[draft][ sample ] improve sample performance when using guide decoding~~ [ V0 ][ sample ] improve sample performance when using guide decoding Mar 21, 2025

cjsdurj force-pushed the main branch from b50ef53 to e61e9ff Compare March 21, 2025 16:34

cjsdurj requested review from DarkLight1337, WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat, simon-mo, tlrmchlsmth, youkaichao, ywang96 and zhuohan123 as code owners March 21, 2025 16:34

mergify bot added documentation Improvements or additions to documentation frontend multi-modality Related to multi-modality (#4194) labels Mar 21, 2025

mergify bot added v1 needs-rebase labels Mar 21, 2025

cjsdurj closed this Mar 21, 2025

cjsdurj force-pushed the main branch from e61e9ff to cfbb8c9 Compare March 21, 2025 17:00

improve gbnf guide performance

b052507

Signed-off-by: cj <2465188464@qq.com>

cjsdurj reopened this Mar 22, 2025

mergify bot removed the needs-rebase label Mar 22, 2025

njhill added v0 and removed v1 labels Mar 29, 2025

supersteves mentioned this pull request Apr 7, 2025

[Bug]: XGrammar-based CFG decoding degraded after 0.6.5 #12122

Closed

1 task

russellb added this to Structured Output Apr 22, 2025

mergify bot added the needs-rebase label Apr 29, 2025

lk-chen mentioned this pull request Apr 30, 2025

[DO NOT SUBMIT] Verify struct output perf. improvement lk-chen/vllm#2

Closed

russellb closed this May 5, 2025

github-project-automation bot moved this to Done in Structured Output May 5, 2025

Uh oh!

[ V0 ][ sample ] improve sample performance when using guide decoding #14962

[ V0 ][ sample ] improve sample performance when using guide decoding #14962

Uh oh!

Conversation

cjsdurj commented Mar 17, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Introduction

background

how it works

test

Uh oh!

github-actions bot commented Mar 17, 2025

Uh oh!

DarkLight1337 commented Mar 17, 2025

Uh oh!

cjsdurj commented Mar 21, 2025

Uh oh!

russellb commented Mar 21, 2025

Uh oh!

mergify bot commented Mar 21, 2025

Uh oh!

russellb commented Mar 21, 2025

Uh oh!

cjsdurj commented Mar 22, 2025

Uh oh!

cjsdurj commented Mar 22, 2025

Uh oh!

russellb commented Mar 22, 2025

Uh oh!

lk-chen commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lk-chen commented Apr 29, 2025

Uh oh!

mergify bot commented Apr 29, 2025

Uh oh!

lk-chen commented Apr 30, 2025

Uh oh!

russellb commented May 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

cjsdurj commented Mar 17, 2025 •

edited by github-actions bot

Loading

lk-chen commented Apr 29, 2025 •

edited

Loading