llama : add example for speculative sampling #2030

ggerganov · 2023-06-28T05:20:52Z

Speculative sampling is explained here: https://arxiv.org/abs/2302.01318

In more simple terms here:

For start, the "draft" model can be generated using the train-text-from-scratch example using the same vocab as LLaMA. Later, we can try to utilize better models.

We also assume that batching multiple tokens with the "main" model is significantly faster compared to processing the tokens one-by-one. This may not yet be the case, but it will be when we close ggml-org/ggml#293

SlyEcho · 2023-06-29T12:27:20Z

Would it make sense to do something like a beam search with the fast model and then evaluate the result with the larger model?

ggerganov · 2023-07-01T15:34:10Z

Yes, this might be even more efficient as it might increase the "success" rate of the drafted sequence

evanmiller · 2023-07-05T15:37:27Z

Note that speculative sampling increases overall compute. The algorithm in the linked paper executes the "main" model in parallel for the speculative sequence:

If local compute resources are saturated then speculative sampling won't decrease prediction latency; the algorithm requires pre-existing parallelism of some kind (either farming out the parallel evaluation or perhaps a multi-node pipeline architecture). Based on my understanding of llama.cpp's architecture, it doesn't seem like a great fit; but maybe there's a modification that can be made to work?

DKormann · 2023-07-28T13:52:35Z

It increases overall computation but also increases parallelisation of inference on the main model so it can still be faster

ggerganov · 2023-08-10T22:07:30Z

Staged speculative decoding

https://arxiv.org/abs/2308.04623

charliexchen · 2023-08-27T06:10:18Z

Hey hey -- I'm one of the authors of https://arxiv.org/abs/2302.01318. It's good to see the open source community pick up on this! I'm sadly not in a position to contribute directly, but since this is already on your radar:

You have way less FLOPs on a CPU, but at the same time DDR4/DDR5 ram is also much slower, so it balances out to an extent. Compute resources will get saturated more quickly compared to most accelerators, but there's enough headroom on higher end CPUs for this to still work. To figure out exactly when this happens, you can just use llama.ccp's batching functionality and time how much you can push things before things start slowing down.
The smallest llamas still give some decent speedups, but you want to maximise the size difference between the models (without making the drafter too terrible) to get the most out of this. You can see that in the Comparison SSp / RSp chart in https://github.com/dust-tt/llama-ssp (This is running on GPU, but assuming model time is proportional it's still instructive)

ggerganov · 2023-08-27T06:36:50Z

@charliexchen Thanks for stopping by. We are getting close to have everything needed to implement this. Hopefully will have a prototype soon

Yes, this matches with my understanding

Model_type	Ms/token	Speed Improvement
SSp 30B/7B	180ms	1.8x

If we can replicate the speed improvement factor on Apple Silicon + Metal, it would be a game changer.

In your experience, if you are generating highly structured text (e.g. source code in some programming language), does it allow you to increase the size difference with the drafter significantly without losing the speed effect? I imagine this would be the case to some extend since there would be many "easy-to-predict" tokens in such cases.

charliexchen · 2023-08-27T06:51:24Z

In our paper we got much higher speedups for the HumanEval code generation task compared to XSUM using the same model pairing, so acceptance rate is indeed rather task specific. If you have an "easier" task in some sense, then shrinking the drafter is absolutely on the table.

evanmiller · 2023-08-27T11:55:40Z

@charliexchen Did you consider using the same model as a draft model? I mean after layer K < N, immediate sample the output to form a draft token.

charliexchen · 2023-08-27T18:04:14Z

This seems related to CALM (Which is mentioned in one of the other threads). It should work, but you need to explicitly train/finetune the model to handle that.

The nice thing about spec sampling is that you don't have to touch the target model at all.

ggerganov · 2023-08-31T06:31:34Z

I'll try to do a PoC of speculative sampling today - will post a branch when I get something running

ggerganov · 2023-09-03T12:29:06Z

Closed via #2926

ggerganov added performance Speed related topics generation quality Quality of model output research 🔬 labels Jun 28, 2023

ggerganov added this to ggml : roadmap Jun 28, 2023

ggerganov moved this to Todo in ggml : roadmap Jul 14, 2023

ggerganov self-assigned this Aug 31, 2023

ggerganov moved this from Todo to In Progress in ggml : roadmap Aug 31, 2023

ggerganov mentioned this issue Aug 31, 2023

speculative : PoC for speeding-up inference via speculative sampling #2926

Merged

2 tasks

manisnesan mentioned this issue Sep 2, 2023

Speculative Sampling manisnesan/til#54

Open

ggerganov mentioned this issue Sep 3, 2023

speculative : add grammar support #2991

Merged

ggerganov closed this as completed Sep 3, 2023

ggerganov moved this from In Progress to Done in ggml : roadmap Sep 3, 2023

zhzLuke96 mentioned this issue Sep 11, 2023

Add support for Speculative Decoding huggingface/text-generation-inference#729

Closed

sammcj mentioned this issue Jul 19, 2024

Enable using llama.cpp's --model-draft <model> feature for speculative decoding ollama/ollama#5800

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : add example for speculative sampling #2030

llama : add example for speculative sampling #2030

ggerganov commented Jun 28, 2023 •

edited

Loading

SlyEcho commented Jun 29, 2023

ggerganov commented Jul 1, 2023

evanmiller commented Jul 5, 2023

DKormann commented Jul 28, 2023

ggerganov commented Aug 10, 2023

charliexchen commented Aug 27, 2023 •

edited

Loading

ggerganov commented Aug 27, 2023 •

edited

Loading

charliexchen commented Aug 27, 2023

evanmiller commented Aug 27, 2023

charliexchen commented Aug 27, 2023 •

edited

Loading

ggerganov commented Aug 31, 2023

ggerganov commented Sep 3, 2023

llama : add example for speculative sampling #2030

llama : add example for speculative sampling #2030

Comments

ggerganov commented Jun 28, 2023 • edited Loading

SlyEcho commented Jun 29, 2023

ggerganov commented Jul 1, 2023

evanmiller commented Jul 5, 2023

DKormann commented Jul 28, 2023

ggerganov commented Aug 10, 2023

charliexchen commented Aug 27, 2023 • edited Loading

ggerganov commented Aug 27, 2023 • edited Loading

charliexchen commented Aug 27, 2023

evanmiller commented Aug 27, 2023

charliexchen commented Aug 27, 2023 • edited Loading

ggerganov commented Aug 31, 2023

ggerganov commented Sep 3, 2023

ggerganov commented Jun 28, 2023 •

edited

Loading

charliexchen commented Aug 27, 2023 •

edited

Loading

ggerganov commented Aug 27, 2023 •

edited

Loading

charliexchen commented Aug 27, 2023 •

edited

Loading