-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Speculative decoding using a draft model #2188
Conversation
I'm very fortunate to witness such great work from you. How is the current progress? May I use your method to accelerate the llama2 70b on vLLM for now? |
hi @Lvjinhong. The current PR requires some work to get into a working state on the public vllm repo. I will start on this later this week but given US holidays I expect to finish early january. |
Very glad to see this work. I also have a wip version of speculative decoding and looking forward to using this feature. |
Created a plan to break this PR into separate pieces. Pending review from vLLM original authors, I will start on it this week. https://docs.google.com/document/d/1rE4pr3IdspRw97XbImY4fS9IWYuJJ3HGtL7AdIKGrw8/edit |
So from the little I understood I think there is a way to avoid having a draft model for your vocabulary by using directly the model's ngrams? It should probably be the subject of a separate PR but I think this is the way forward for everyone to easily enjoy the benefits of speculative decoding. What do you think @cadedaniel, do you think its as straigh forward as @simon-mo puts it?
|
Yep. This PR builds the framework for scoring and verifying draft tokens, independent of whether they come from a draft model or something like Medusa, or lookup ngrams like prompt lookup or RAG lookup. |
I think I understand a bit better now thank you, @cadedaniel , again for this incredible work! |
@cadedaniel Hi! Thanks for your great work! As I understand it, during the prefill stage, we are going to run the draft model once and the target model once. Will this increase the first token latency? |
Yes, the time to first token is a few milliseconds higher for a draft model of size 68m. It can be optimized in future versions, e.g. with Medusa/EAGLE where draft tokens are generated without independent kv cache. |
@cadedaniel Thanks for your reply! I also want to confirm my understanding regarding the decoding step's impact on first token latency:
I'm wondering if it helps to break the decoding step into 2 sub-steps, drafting and verifying. In this way, newly arriving prefill requests wait up to What do you think? |
@UranusSeven could you ask your question in a discussion post? happy to answer there |
Hi, thanks for the great work! Do you have some speed tps benchmarks? I'd like to use SD with deepseeks 33B and 1B models Also, does this PR support an SD model with a different tokenizer than the main model? (for example llama with deepseek SD model) |
Out of curiosity, does the proposal support separate deployment for the draft and target model? Asking because in production these two likely have different QPS and computing resource requirements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vllm-public/vllm/config.py", line 62
def __init__(self,
^
SyntaxError: '(' was never closed
cuda_graph_max_context_len: int = 5000, | ||
cuda_graph_cache_size: int = 10, | ||
flash_style: bool = False, | ||
max_chunked_prefill_len: int = -1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vllm-public/vllm/engine/async_llm_engine.py", line 7, in <module>
from vllm.anyscale.lora.utils import LoRARequest
ModuleNotFoundError: No module named 'vllm.anyscale'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having the same issue. Any update on this?
Hi everyone, want to provide a quick update here:
|
Thank you for your update and your efforts!
…On Tue, Feb 20, 2024 at 10:49 AM Cade Daniel ***@***.***> wrote:
Hi everyone, want to provide a quick update here:
- Last few weeks I've prioritized optimizing Mixtral latency #2913
<#2913>
- Now I will focus on getting this merged full-time. I am aiming to
finish merges by February with @LiuXiaoxuanPKU
<https://github.com/LiuXiaoxuanPKU>'s help reviewing.
- After the correctness tests are merged, I will accept any
optimizations that pass correctness tests. I will list out some major
optimizations that people can take on (and already some tech discussions
happening on MQA cc @ymwangg <https://github.com/ymwangg>
@robertgshaw2-neuralmagic <https://github.com/robertgshaw2-neuralmagic>
).
—
Reply to this email directly, view it on GitHub
<#2188 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAACPY5KTS4UPZU6NU3Y53DYUTV5LAVCNFSM6AAAAABA2LBWYCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJUHA2TQOJUHE>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Thanks for the update! Looking forward to seeing how to incorporate our work once your PRs are out. |
Thanks @cadedaniel for truly inspiring work! 🙏 Would Speculative decoding work with vLLM's continuous batching as well? Would that be the step |
Yep! Once the e2e correctness tests pass you can use it with continuous batching. |
Hi @cadedaniel , |
Thanks! It would be helpful if you could test it out and add features or optimizations. Once #3951 is merged it will be correct but not yet fast; I'll post more about optimizing it this week. |
Great, I'll start tomorrow (CET) |
will |
Hi guys, |
Hi @HimanshuJanbandhu, I would like also mention my recent work S3D (https://arxiv.org/abs/2405.20314). It is very similar to your mentioned Self-Speculative Decoding work which is simple to implement and easy to be integrated to existing stacks. But we have achieved better efficiency in general (compared to Self-Spec), our method combines layer-skipping with multiple next-token generation/unmasking. Although ours requires a bit training, but it should be straightforward just like training a Transformer encoder like BERT. |
We welcome a self-speculative implementation! |
by the way: sorry for volunteering and never coming back to you. Unfortunately, I am and will be reaaally busy until beginning of August ✌🏻 But thank you very much for your efforts!!! |
Hello, Teacher, it is a great honor to witness your magnificent work. I have been studying you SpecDecode work recently, and I have a question that I hope you can guide me on. Why does SpecDecodeWorker inherit from LoraNotSupportedWorkerBase? Why doesn't the current SpecDecode support Lora? @cadedaniel |
Hi @skylee-01 . See #6912 for the work required to add LoRA + spec decode. |
Thanks for the interest everyone! With #4630 (comment), all the work in this PR has been merged. |
Speculative decoding
This PR adds speculative decoding to vLLM using the draft model approach. This was first explored in these papers:
As a simplified overview, this type of speculative decoding runs a smaller draft model to guess what the larger target model will emit. The target model then verifies the guesses, and emits them if they pass verification. This yields a latency reduction as the verification of many tokens can happen in a single forward pass of the target model.
Running this at Anyscale, I see a 30-50% latency reduction depending on the draft model, target model and dataset.
Usage
The usage looks like this:
Feature list
Future work (not implemented)
This PR is marked as draft as there is nontrivial work required to get this into a mergable state.
Guide for reviewers
The following are key files for understanding this speculative decoding implementation: