-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
[Perf] Parallelize fill_bitmask to accelerate high-throughput guided decoding #21862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
| # Serialization of np.ndarray is much more efficient than a tensor, | ||
| # so we receive it in that format. | ||
| grammar_bitmask = torch.from_numpy(grammar_bitmask) | ||
| grammar_bitmask = torch.from_numpy(grammar_bitmask).contiguous() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just for sanity, could someone confirm that the above numpy-indexing logic with torch.from_numpy will always create a contiguous array and not a row-wise view?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
afaik torch.from_numpy creates a view, so as long as the original numpy array is contiguous, it should be ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces performance improvements for guided decoding at high throughput by parallelizing fill_bitmask. The changes also include caching grammar.is_terminated() and optimizing an xGrammar function call. The performance gains demonstrated in the description are impressive.
My review has identified two critical correctness issues that need to be addressed:
- The logic for advancing the FSM state during speculative decoding appears to be incorrect, which could lead to invalid bitmasks.
- The newly introduced caching for the grammar's terminated state is not correctly updated when the FSM state is rolled back, which can cause incorrect behavior.
Addressing these issues is crucial for the stability and correctness of the new implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One tiny comment about np, but otherwise we might also want to do this with guidance as well.
|
@aarnphm The parallelization code is independent of backend. I tested that it works for xGrammar and Guidance, contributing a significant speedup to both. The termination caching trick is only for xGrammar as I noticed it being slow there, I assume the other backends have this built-in in one way or another. We can add in a follow-up if it still presents as a slowdown. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM and the optimization is compelling. I do wish we had a throughput test for structured output, I don't think we flex that anywhere in CI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you!
Co-authored-by: Russell Bryant <rbryant@redhat.com> Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
Head branch was pushed to by a user without write access
2a6c2ff to
19fd946
Compare
…decoding (vllm-project#21862) Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
…decoding (vllm-project#21862) Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
…decoding (vllm-project#21862) Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
…decoding (vllm-project#21862) Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai> Signed-off-by: Noam Gat <noamgat@gmail.com>
…decoding (vllm-project#21862) Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai> Signed-off-by: Paul Pak <paulpak58@gmail.com>
…decoding (vllm-project#21862) Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai> Signed-off-by: Diego-Castan <diego.castan@ibm.com>
…decoding (vllm-project#21862) Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
…decoding (vllm-project#21862) Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai> Signed-off-by: Xiao Yu <xiao.yu@amd.com>
…decoding (vllm-project#21862) Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
Purpose
At large batch sizes,
fill_bitmaskbecomes a bottleneck and should be parallelized. Since the libraries we use (xGrammar, Guidance, Outlines-core) make low-level calls, they release the GIL and can be trivially parallelized using threads. This PR adds shortcut logic to dispatch thefill_bitmaskcalls to a thread pool when:This PR also caches the
grammar.is_terminated()value in the xGrammar backend as it is accessed more often than it is changed.Also, the xGrammar
apply_token_bitmask_inplace_torch_compilefunction is not being compiled in some cases because the input indices are aList[int]. This is unavoidable in some cases, but when all requests in the batch are using structured outputs then we can skip this and use a slightly simpler path which will always compile properly.Test Plan
No additional testing is added as no new functionality is introduced.
Perf Results
Benchmarks indicate strong performance gain with small models and large batch sizes. Internal experiments indicate 1.5 - 1.8 speedup. Because of the fallback condition, other workloads should not experience any disruption.
Below are reference outputs using:
with thinking disabled.
Benchmark TL;DR:
This PR
This PR, parallel disabled
Previous main
Guided decoding disabled