Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : lookahead decoding example #4157

Closed
wsxiaoys opened this issue Nov 21, 2023 · 6 comments
Closed

llama : lookahead decoding example #4157

wsxiaoys opened this issue Nov 21, 2023 · 6 comments
Assignees
Labels
enhancement New feature or request

Comments

@wsxiaoys
Copy link
Contributor

wsxiaoys commented Nov 21, 2023

Claim providing 1.5~2x decoding speedup without a speculative model

Blog post: https://lmsys.org/blog/2023-11-21-lookahead-decoding/
Twitter thread: https://twitter.com/lmsysorg/status/1727056892671950887
Reference implementation: https://github.com/hao-ai-lab/LookaheadDecoding/tree/main

@wsxiaoys wsxiaoys added the enhancement New feature or request label Nov 21, 2023
@bobqianic
Copy link
Contributor

How does Medusa differ from this method?
https://github.com/FasterDecoding/Medusa

@someone13574
Copy link

How does Medusa differ from this method? https://github.com/FasterDecoding/Medusa

I believe (correct me if I'm wrong) that it doesn't require extra training of the model.

@KerfuffleV2
Copy link
Collaborator

How does Medusa differ from this method?

The blog link actually mentions Medusa specifically and then talks about how their approach is different.

@SlyEcho
Copy link
Collaborator

SlyEcho commented Nov 22, 2023

It certainly seems a little faster to me. From 30 t/s to 40 t/s on the LLaMA2-7B-chat example.

@ggerganov ggerganov changed the title Lookahead decoding example llama : lookahead decoding example Nov 23, 2023
@ggerganov ggerganov self-assigned this Nov 23, 2023
@ggerganov ggerganov moved this to In Progress in ggml : roadmap Nov 23, 2023
@shermansiu
Copy link

Also, lookahead decoding (LADE) seems to be constrained by the number of FLOPS available in consumer GPUs. I'm not sure how this will translate to CPU/RAM requirements, but whether LADE delivers improvements in performance seems to depend on how powerful your hardware is and whether the LADE parameters are optimized for your hardware.

huggingface/transformers#27649 (comment)

@ggerganov
Copy link
Owner

Example in #4207

@ggerganov ggerganov moved this from In Progress to Done in ggml : roadmap Nov 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Done
Development

No branches or pull requests

7 participants