-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama : lookahead decoding example #4157
Comments
How does Medusa differ from this method? |
I believe (correct me if I'm wrong) that it doesn't require extra training of the model. |
The blog link actually mentions Medusa specifically and then talks about how their approach is different. |
It certainly seems a little faster to me. From 30 t/s to 40 t/s on the LLaMA2-7B-chat example. |
Also, lookahead decoding (LADE) seems to be constrained by the number of FLOPS available in consumer GPUs. I'm not sure how this will translate to CPU/RAM requirements, but whether LADE delivers improvements in performance seems to depend on how powerful your hardware is and whether the LADE parameters are optimized for your hardware. |
Example in #4207 |
Claim providing 1.5~2x decoding speedup without a speculative model
Blog post: https://lmsys.org/blog/2023-11-21-lookahead-decoding/
Twitter thread: https://twitter.com/lmsysorg/status/1727056892671950887
Reference implementation: https://github.com/hao-ai-lab/LookaheadDecoding/tree/main
The text was updated successfully, but these errors were encountered: