-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: Speculative sampling / Assisted Generation #169
Comments
Yes, it's on the roadmap |
Hi @michaelfeil Speculative Decoding is on our roadmap. Are you looking a draft model or self-speculative decoding? Or other? Also, we will soon put out a rough set of requests we've received along with a rough roadmap so stay tuned! |
will the specculative decoding support python runtime? |
Hi @ywran , The support will be added to the C++ runtime first. We are also taking the question of a Python runtime seriously and are evaluating the best approch to offer a Python binding of our C++ runtime. We do not have a concrete timeline for now but we are going to keep everyone updated as we make progress. Thanks, |
Primarily looking into speeding up e.g. starcoder(15B) with "old style" assisted generation, e.g. https://huggingface.co/bigcode/tiny_starcoder_py. If you got additional ideas, I am open to discuss. |
Looks like vllm is close to having this feature implemented: vllm-project/vllm#1679. Any news for TensorRT? |
We are making the required improvements to the MHA kernels right now & are looking at a few different techniques for speculative decoding. Keep an eye out over the next few weeks for more details |
Hey @ncomly-nvidia can you please tell us when we can be able to test assisted Generation on TensorRT? |
Hey @Dev-hestabit. Our goal is to have a functional preview of speculative decoding in the next release (<1 month). We'll be sure to include it in the discussion when it is added to main & release notes once included officially. |
@ncomly-nvidia What models are planned to have support for speculative decoding? |
Hi @shannonphu we are starting with Llama variants. What models are you interested in? |
@ncomly-nvidia I am interested in encoder-decoder type models like T5/FLAN-T5. I am not sure if its possible to do speculative decoding on enc-dec though :) |
@ncomly-nvidia i have been going through tensorrt backend commits and 4 days ago there is an update for speculative decoding deployment can we try speculative decoding with tensorrtllm backend? is there any document that can help us as there is no update in readme for that repo . |
Yep! We're working on an example w/ docs now - there is an implementation you can reference here |
Assisted Generation implemented with transformers: https://huggingface.co/blog/gemma-july-update#assisted-generation We need tensorrt add assisted-generation |
Please refer to https://nvidia.github.io/TensorRT-LLM/advanced/speculative-decoding.html |
An obvious feature to me, but also not one that is simple to implement - is speculative sampling on the road map?
The idea would be using a second tiny-model combined with e.g. for greedy validation of the main model.
For more information:
https://huggingface.co/blog/assisted-generation
Example models for Speculative sampling:
https://huggingface.co/bigcode/tiny_starcoder_py
Related frameworks:
huggingface/text-generation-inference#1169
The text was updated successfully, but these errors were encountered: