Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : add example for speculative sampling #2030

Closed
ggerganov opened this issue Jun 28, 2023 · 12 comments
Closed

llama : add example for speculative sampling #2030

ggerganov opened this issue Jun 28, 2023 · 12 comments
Assignees
Labels
generation quality Quality of model output performance Speed related topics research 🔬

Comments

@ggerganov
Copy link
Owner

ggerganov commented Jun 28, 2023

Speculative sampling is explained here: https://arxiv.org/abs/2302.01318

In more simple terms here:

For start, the "draft" model can be generated using the train-text-from-scratch example using the same vocab as LLaMA. Later, we can try to utilize better models.

We also assume that batching multiple tokens with the "main" model is significantly faster compared to processing the tokens one-by-one. This may not yet be the case, but it will be when we close ggerganov/ggml#293

@ggerganov ggerganov added performance Speed related topics generation quality Quality of model output research 🔬 labels Jun 28, 2023
@SlyEcho
Copy link
Collaborator

SlyEcho commented Jun 29, 2023

Would it make sense to do something like a beam search with the fast model and then evaluate the result with the larger model?

@ggerganov
Copy link
Owner Author

Yes, this might be even more efficient as it might increase the "success" rate of the drafted sequence

@evanmiller
Copy link
Contributor

Note that speculative sampling increases overall compute. The algorithm in the linked paper executes the "main" model in parallel for the speculative sequence:

image

If local compute resources are saturated then speculative sampling won't decrease prediction latency; the algorithm requires pre-existing parallelism of some kind (either farming out the parallel evaluation or perhaps a multi-node pipeline architecture). Based on my understanding of llama.cpp's architecture, it doesn't seem like a great fit; but maybe there's a modification that can be made to work?

@DKormann
Copy link

It increases overall computation but also increases parallelisation of inference on the main model so it can still be faster

@ggerganov
Copy link
Owner Author

Staged speculative decoding

https://arxiv.org/abs/2308.04623

@charliexchen
Copy link

charliexchen commented Aug 27, 2023

Hey hey -- I'm one of the authors of https://arxiv.org/abs/2302.01318. It's good to see the open source community pick up on this! I'm sadly not in a position to contribute directly, but since this is already on your radar:

  1. You have way less FLOPs on a CPU, but at the same time DDR4/DDR5 ram is also much slower, so it balances out to an extent. Compute resources will get saturated more quickly compared to most accelerators, but there's enough headroom on higher end CPUs for this to still work. To figure out exactly when this happens, you can just use llama.ccp's batching functionality and time how much you can push things before things start slowing down.
  2. The smallest llamas still give some decent speedups, but you want to maximise the size difference between the models (without making the drafter too terrible) to get the most out of this. You can see that in the Comparison SSp / RSp chart in https://github.com/dust-tt/llama-ssp (This is running on GPU, but assuming model time is proportional it's still instructive)

@ggerganov
Copy link
Owner Author

ggerganov commented Aug 27, 2023

@charliexchen Thanks for stopping by. We are getting close to have everything needed to implement this. Hopefully will have a prototype soon

  1. Yes, this matches with my understanding

Model_type Ms/token Speed Improvement
SSp 30B/7B 180ms 1.8x

If we can replicate the speed improvement factor on Apple Silicon + Metal, it would be a game changer.

In your experience, if you are generating highly structured text (e.g. source code in some programming language), does it allow you to increase the size difference with the drafter significantly without losing the speed effect? I imagine this would be the case to some extend since there would be many "easy-to-predict" tokens in such cases.

@charliexchen
Copy link

In our paper we got much higher speedups for the HumanEval code generation task compared to XSUM using the same model pairing, so acceptance rate is indeed rather task specific. If you have an "easier" task in some sense, then shrinking the drafter is absolutely on the table.

@evanmiller
Copy link
Contributor

@charliexchen Did you consider using the same model as a draft model? I mean after layer K < N, immediate sample the output to form a draft token.

@charliexchen
Copy link

charliexchen commented Aug 27, 2023

This seems related to CALM (Which is mentioned in one of the other threads). It should work, but you need to explicitly train/finetune the model to handle that.

The nice thing about spec sampling is that you don't have to touch the target model at all.

@ggerganov
Copy link
Owner Author

I'll try to do a PoC of speculative sampling today - will post a branch when I get something running

@ggerganov
Copy link
Owner Author

Closed via #2926

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
generation quality Quality of model output performance Speed related topics research 🔬
Projects
Status: Done
Development

No branches or pull requests

5 participants