Feature: Speculative sampling / Assisted Generation #169

michaelfeil · 2023-10-27T13:31:46Z

An obvious feature to me, but also not one that is simple to implement - is speculative sampling on the road map?

The idea would be using a second tiny-model combined with e.g. for greedy validation of the main model.
For more information:
https://huggingface.co/blog/assisted-generation

Example models for Speculative sampling:
https://huggingface.co/bigcode/tiny_starcoder_py

Related frameworks:
huggingface/text-generation-inference#1169

anjalibshah · 2023-10-27T19:42:25Z

Yes, it's on the roadmap

ncomly-nvidia · 2023-10-27T20:54:29Z

Hi @michaelfeil Speculative Decoding is on our roadmap. Are you looking a draft model or self-speculative decoding? Or other?

Also, we will soon put out a rough set of requests we've received along with a rough roadmap so stay tuned!

ywran · 2023-10-29T13:47:58Z

will the specculative decoding support python runtime？

jdemouth-nvidia · 2023-10-30T05:20:43Z

Hi @ywran ,

The support will be added to the C++ runtime first. We are also taking the question of a Python runtime seriously and are evaluating the best approch to offer a Python binding of our C++ runtime. We do not have a concrete timeline for now but we are going to keep everyone updated as we make progress.

Thanks,
Julien

michaelfeil · 2023-11-07T20:23:13Z

Hi @michaelfeil Speculative Decoding is on our roadmap. Are you looking a draft model or self-speculative decoding? Or other?

Also, we will soon put out a rough set of requests we've received along with a rough roadmap so stay tuned!

Primarily looking into speeding up e.g. starcoder(15B) with "old style" assisted generation, e.g. https://huggingface.co/bigcode/tiny_starcoder_py. If you got additional ideas, I am open to discuss.

jFkd1 · 2023-11-16T21:08:17Z

Looks like vllm is close to having this feature implemented: vllm-project/vllm#1679. Any news for TensorRT?

ncomly-nvidia · 2023-11-22T01:13:36Z

We are making the required improvements to the MHA kernels right now & are looking at a few different techniques for speculative decoding. Keep an eye out over the next few weeks for more details

Dev-hestabit · 2023-11-30T13:09:42Z

Hey @ncomly-nvidia can you please tell us when we can be able to test assisted Generation on TensorRT?

ncomly-nvidia · 2023-12-05T00:44:59Z

Hey @Dev-hestabit. Our goal is to have a functional preview of speculative decoding in the next release (<1 month). We'll be sure to include it in the discussion when it is added to main & release notes once included officially.

shannonphu · 2023-12-08T18:17:08Z

@ncomly-nvidia What models are planned to have support for speculative decoding?

ncomly-nvidia · 2023-12-11T18:04:08Z

Hi @shannonphu we are starting with Llama variants. What models are you interested in?

shannonphu · 2023-12-11T19:46:10Z

@ncomly-nvidia I am interested in encoder-decoder type models like T5/FLAN-T5. I am not sure if its possible to do speculative decoding on enc-dec though :)

MrD005 · 2023-12-12T18:49:32Z

@ncomly-nvidia i have been going through tensorrt backend commits and 4 days ago there is an update for speculative decoding deployment
triton-inference-server/tensorrtllm_backend@1309995

can we try speculative decoding with tensorrtllm backend?

is there any document that can help us as there is no update in readme for that repo .

ncomly-nvidia · 2023-12-18T18:20:31Z

Yep!

We're working on an example w/ docs now - there is an implementation you can reference here

Alireza3242 · 2024-09-09T12:41:50Z

Assisted Generation implemented with transformers:

https://huggingface.co/blog/gemma-july-update#assisted-generation

We need tensorrt add assisted-generation

nv-guomingz · 2024-11-18T14:47:52Z

Please refer to https://nvidia.github.io/TensorRT-LLM/advanced/speculative-decoding.html
please feel free to reopen this ticket if needed.

jdemouth-nvidia added the feature request New feature or request label Oct 28, 2023

jdemouth-nvidia assigned ncomly-nvidia Oct 28, 2023

ncomly-nvidia added the triaged Issue has been triaged by maintainers label Nov 6, 2023

ToddThomson mentioned this issue Dec 8, 2023

Development Updates on Main/Rel Branch: Bare Metal Building with Windows 11 #618

Open

ncomly-nvidia mentioned this issue Dec 11, 2023

TensorRT-LLM Requests #632

Open

41 tasks

nv-guomingz closed this as completed Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Speculative sampling / Assisted Generation #169

Feature: Speculative sampling / Assisted Generation #169

michaelfeil commented Oct 27, 2023

anjalibshah commented Oct 27, 2023

ncomly-nvidia commented Oct 27, 2023

ywran commented Oct 29, 2023

jdemouth-nvidia commented Oct 30, 2023

michaelfeil commented Nov 7, 2023

jFkd1 commented Nov 16, 2023

ncomly-nvidia commented Nov 22, 2023 •

edited

Loading

Dev-hestabit commented Nov 30, 2023

ncomly-nvidia commented Dec 5, 2023

shannonphu commented Dec 8, 2023

ncomly-nvidia commented Dec 11, 2023

shannonphu commented Dec 11, 2023 •

edited

Loading

MrD005 commented Dec 12, 2023 •

edited

Loading

ncomly-nvidia commented Dec 18, 2023

Alireza3242 commented Sep 9, 2024

nv-guomingz commented Nov 18, 2024

Feature: Speculative sampling / Assisted Generation #169

Feature: Speculative sampling / Assisted Generation #169

Comments

michaelfeil commented Oct 27, 2023

anjalibshah commented Oct 27, 2023

ncomly-nvidia commented Oct 27, 2023

ywran commented Oct 29, 2023

jdemouth-nvidia commented Oct 30, 2023

michaelfeil commented Nov 7, 2023

jFkd1 commented Nov 16, 2023

ncomly-nvidia commented Nov 22, 2023 • edited Loading

Dev-hestabit commented Nov 30, 2023

ncomly-nvidia commented Dec 5, 2023

shannonphu commented Dec 8, 2023

ncomly-nvidia commented Dec 11, 2023

shannonphu commented Dec 11, 2023 • edited Loading

MrD005 commented Dec 12, 2023 • edited Loading

ncomly-nvidia commented Dec 18, 2023

Alireza3242 commented Sep 9, 2024

nv-guomingz commented Nov 18, 2024

ncomly-nvidia commented Nov 22, 2023 •

edited

Loading

shannonphu commented Dec 11, 2023 •

edited

Loading

MrD005 commented Dec 12, 2023 •

edited

Loading