Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Speculative sampling / Assisted Generation #169

Closed
michaelfeil opened this issue Oct 27, 2023 · 16 comments
Closed

Feature: Speculative sampling / Assisted Generation #169

michaelfeil opened this issue Oct 27, 2023 · 16 comments
Assignees
Labels
feature request New feature or request triaged Issue has been triaged by maintainers

Comments

@michaelfeil
Copy link

An obvious feature to me, but also not one that is simple to implement - is speculative sampling on the road map?

The idea would be using a second tiny-model combined with e.g. for greedy validation of the main model.
For more information:
https://huggingface.co/blog/assisted-generation

Example models for Speculative sampling:
https://huggingface.co/bigcode/tiny_starcoder_py

Related frameworks:
huggingface/text-generation-inference#1169

@anjalibshah
Copy link

Yes, it's on the roadmap

@ncomly-nvidia
Copy link
Collaborator

Hi @michaelfeil Speculative Decoding is on our roadmap. Are you looking a draft model or self-speculative decoding? Or other?

Also, we will soon put out a rough set of requests we've received along with a rough roadmap so stay tuned!

@ywran
Copy link

ywran commented Oct 29, 2023

will the specculative decoding support python runtime?

@jdemouth-nvidia
Copy link
Collaborator

Hi @ywran ,

The support will be added to the C++ runtime first. We are also taking the question of a Python runtime seriously and are evaluating the best approch to offer a Python binding of our C++ runtime. We do not have a concrete timeline for now but we are going to keep everyone updated as we make progress.

Thanks,
Julien

@ncomly-nvidia ncomly-nvidia added the triaged Issue has been triaged by maintainers label Nov 6, 2023
@michaelfeil
Copy link
Author

Hi @michaelfeil Speculative Decoding is on our roadmap. Are you looking a draft model or self-speculative decoding? Or other?

Also, we will soon put out a rough set of requests we've received along with a rough roadmap so stay tuned!

Primarily looking into speeding up e.g. starcoder(15B) with "old style" assisted generation, e.g. https://huggingface.co/bigcode/tiny_starcoder_py. If you got additional ideas, I am open to discuss.

@jFkd1
Copy link

jFkd1 commented Nov 16, 2023

Looks like vllm is close to having this feature implemented: vllm-project/vllm#1679. Any news for TensorRT?

@ncomly-nvidia
Copy link
Collaborator

ncomly-nvidia commented Nov 22, 2023

We are making the required improvements to the MHA kernels right now & are looking at a few different techniques for speculative decoding. Keep an eye out over the next few weeks for more details

@Dev-hestabit
Copy link

Hey @ncomly-nvidia can you please tell us when we can be able to test assisted Generation on TensorRT?

@ncomly-nvidia
Copy link
Collaborator

Hey @Dev-hestabit. Our goal is to have a functional preview of speculative decoding in the next release (<1 month). We'll be sure to include it in the discussion when it is added to main & release notes once included officially.

@shannonphu
Copy link

@ncomly-nvidia What models are planned to have support for speculative decoding?

@ncomly-nvidia
Copy link
Collaborator

Hi @shannonphu we are starting with Llama variants. What models are you interested in?

@shannonphu
Copy link

shannonphu commented Dec 11, 2023

@ncomly-nvidia I am interested in encoder-decoder type models like T5/FLAN-T5. I am not sure if its possible to do speculative decoding on enc-dec though :)

@MrD005
Copy link

MrD005 commented Dec 12, 2023

@ncomly-nvidia i have been going through tensorrt backend commits and 4 days ago there is an update for speculative decoding deployment
triton-inference-server/tensorrtllm_backend@1309995

can we try speculative decoding with tensorrtllm backend?

is there any document that can help us as there is no update in readme for that repo .

@ncomly-nvidia
Copy link
Collaborator

Yep!

We're working on an example w/ docs now - there is an implementation you can reference here

@Alireza3242
Copy link

Assisted Generation implemented with transformers:

https://huggingface.co/blog/gemma-july-update#assisted-generation

We need tensorrt add assisted-generation

@nv-guomingz
Copy link
Collaborator

Please refer to https://nvidia.github.io/TensorRT-LLM/advanced/speculative-decoding.html
please feel free to reopen this ticket if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

12 participants