-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for prompt-lookup speculative decoding #2469
Comments
@cadedaniel is in charge of adding overall support for speculative decoding here: #2188, I would imagine after this PR, ngram support should be very straightforward. |
@simon-mo Thanks for letting me know! |
Like always it's a bit more complicated than I initially anticipated but I am glad to see its in the works. I'll close this issue even if its not there yet, as the community already knows about it and is well on its way to archive it; speculative decoding w/ vLLM. 🎉 |
thanks for bringing this up @wasertech ! we have an internal prototype for exactly this and it shows good results, but it's blocked on #2188 at the moment |
Looking forward to test it on my hardware. I am training atm, but I will give your branch a try later @cadedaniel Thanks for your amazing contribution 🚀! |
You know what lets keep this issue open so that people who are wondering too know what's up. I (or someone with auth) can close it once #2188 (and the PR that uses it to introduce ngram speculation) |
Closing as duplicate, see #1802 |
So transformers has introduced support for speculative decoding of ngrams.
huggingface/transformers#27979
It's as simple as passing
prompt_lookup_num_tokens=10
tomodel.generate
in newer version of transformers.Why would this be useful?
Most often it will speed up inference by up to 3x!
I have not looked it up yet but I think it wouldn't be too complicated to add a parameter to vLLM so that we can use speculative decoding w/ vLLM. At least the speed up can make the trouble worthwhile.
Let me know what you think.
The text was updated successfully, but these errors were encountered: