-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance] [Speculative decoding]: Support draft model on different tensor-parallel size than target model #4632
[Performance] [Speculative decoding]: Support draft model on different tensor-parallel size than target model #4632
Comments
I can work on this after a major refactor of distributed #4591 is landed. |
@cadedaniel I'm aware that #4933 is going on, so I want to confirm that it's okay to do it. |
@wooyeonlee0 |
yep, my policy is to review the PRs in the order that they're initially ready for review. go ahead @wooyeonlee0 . |
Thanks for the answer :) |
Overview
Speculative decoding allows a speedup for memory-bound LLMs by using a fast proposal method to propose tokens that are verified in a single forward pass by the larger LLM. Papers report 2-3x speedup for bs=1, in Anyscale's fork we see up to 2x speedup with a small draft model for bs=8 (30% for bs=16) (we can improve this! see #4630 if you want to help).
A key optimization for small models (68m/160m domain) is to use tensor-parallel degree 1, even if the target model is using tensor-parallel degree 4 or 8. In our fork, this reduces proposal time from 5ms/tok to 1.5ms/tok. This will allow a well-aligned 68m draft model to get 2x per-user throughput improvement on 70B target model.
Furthermore, a 1B/7B proposer model may ideally be placed on TP=2 or TP=4, while the larger model is placed on TP=8. vLLM should support these configuration so the community can use the configuration best for their draft model.
Design suggestions
I implemented a Worker which patches the tensor parallel group to TP1 in our fork. The code is dumped here. We should use this approach in vLLM, however we can improve it by using @youkaichao 's tensor-parallel group improvements.
The text was updated successfully, but these errors were encountered: