-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
[ V0 ][ sample ] improve sample performance when using guide decoding #14962
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
|
Sorry, can you merge from latest main to fix the pre-commit failures? |
ok. l have fixed the pre-commit faulures |
All commit messages also need About the change -- the results are impressive, but I'm a bit concerned about what we'd do with this for V1. As written, this will only work with the feature for V0, but we're trying to focus our enhancements on V1 as much as possible. Have you compared this to V1? That would be good as another column in your comparison. In other words, does V0 + this enhancement beat V1 with structured output in use? Or do the other enhancements in V1 already make V1 faster without this optimization in place? |
|
This pull request has merge conflicts that must be resolved before it can be |
|
@cjsdurj did you mean to close this? |
Signed-off-by: cj <2465188464@qq.com>
because it changes too many py files , currently it only works on V0 and qwen llama model as a preview feature. |
if this works ok. In subsequent submissions, I will Implement this feature in V1 engine for more models. |
It will not work with V1 by design right now. In V1, advancing the grammar's FSM and applying the bitmask are in separate processes. I'd like to see more performance numbers, in particular with large batches of requests and not just a single request. I'll do some benchmarking at some point if you don't have the hardware for it (I'll want to see some H100 results). |
|
#17084 removed sampler from model, this PR needs rebase. |
|
This pull request has merge conflicts that must be resolved before it can be |
|
Hi @cjsdurj , may I ask how to produce before throughput of 2tk/s and after throughput of 136 tk/s ? I'm using lk-chen#2 on L40S, forcing vLLM v0, model=Qwen/Qwen2.5-1.5B-Instruct, async mode, and got
|
|
Since this only applies to V0 as written, I'm going to close this out for now. V0 is deprecated and we're not adding major changes at this point. I think significant changes are necessary to come up with something that works with V1. Please feel free to reopen this or open a new PR if you come up with an approach that works in V1. Thank you! |
Introduction
this pr introduce a new sample algorithm ( SampleV2) to improve guide decoding performance.
currently support xgrammer backend and qwen , llama model .
refer to interface 'vllm. model_executor. models. interfaces.SupportsSampleV2'.
this PR increased ebnf guide decode throughput by more than 1000% .
background
in my test case, the grammar is below:
when vllm use 4x L40s GPU serve a qwen72b awq 4bit model, the decode speed is 50 tokens/s. but applied guided_grammar, decode speed is 2.5 tokens /s .
I optimized the grammar as below , then it outputs 12 tokens/s:
how it works
The new sampling method (samplev2) consists of the following steps:
a. compute logits from hidden_states.
b. sample the batched outout without grammar guide.
c. let FSM try to accept the output tokens , if call accept, return directly ( did not call FillNextTokenBitmask).
d. if FSM can not accept , apply grammar guide to logits , and resample .
test
single clinet test result (nearly no cost when apply guided_grammar):
current this pr only supported backend xgrammar and models like qwen2 ,llama . more model can be supported by extend class SupportsSampleV2