-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Speculative Decoding #729
Comments
Yes that's something that we want to explore. |
What would be the best way to get this conversation moving? I have started looking at implementing this myself, but am not yet sure how this would best fit together. For performance reasons it would be ideal if we could keep this step in the rust code. I also suspect that it would be best placed somewhere in the Does this sound like a good idea? |
Hey thanks for proposing to contribute. Disclaimer: It's august so a lot of the team members are off to vacation and I myself am handling various (too many perhaps) projects. So dev bandwidth is rather low right now for adding this. Brainstorm DX: As a user I think I'd like to simply say:
I think it's relatively simple and conveys correctly the intent. If we need more complex control we'd have to break from the usual pure cli flags, and use some kind of config instead. Implentation wise: As usual, the smaller the PR, the better it is.
I think keeping the exact same routes |
This is great. But I think there might be more performance gains elsewhere before getting to speculative decoding. In many cases I have seen for LLM larger than 3B, the CPU is bottlenecking the text generation. huggingface/transformers#24524 Introducing more layers without resolving the underlying bottleneck might not be fruitful. |
@calvintwr we're not using |
@calvintwr, yes this CPU bottleneck is the reason we often re-write the modelling code in TGI. Speculative decoding is our main priority for the next release. |
I don't have the expertise to comment, but just in case: If this https://twitter.com/tianle_cai/status/1701272996328120408 is good enough that it is a viable alternative to the "separate draft model" approach to speculative decoding, then perhaps it could inform some design decisions for the eventual form that this API takes (even just naming of parameters like Edit: Actually, after more than a 2 second skim of that twitter thread, I think this is "baked into" the model/network itself? So maybe use of this technique can be opaque to TGI? (I probably shouldn't be commenting here - just adding newbie noise)
|
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
Feature request
There is a new and interesting paper from Google Research that promising 2-3X speedups of LLM inference by running two models in parallel. The core idea is using a faster, and lower quality model, that approximates the target model to sample multiple tokens and then check these samples using the target model.
E.g. Sample from LLaMA 7b quickly, then use LLaMA 70b to check the samples.
Motivation
Adding this kind of support would make LLM sampling much faster.
I have considered running an alternative implementation where I run two copies of TGI and a new web server implementing Speculative Decoding in the same kubernetes pod/vm/server. However the added overhead of running HTTP between all these containers is likely to erase a significant portion of the gains in inference speed.
Core challenges:
Your contribution
Presuming the maintainers are happy with adding this feature, I would start work and implement it. This would probably take the form of several PRs, as this change would be signficant.
The text was updated successfully, but these errors were encountered: