-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama : add example for speculative sampling #2030
Comments
Would it make sense to do something like a beam search with the fast model and then evaluate the result with the larger model? |
Yes, this might be even more efficient as it might increase the "success" rate of the drafted sequence |
Note that speculative sampling increases overall compute. The algorithm in the linked paper executes the "main" model in parallel for the speculative sequence: If local compute resources are saturated then speculative sampling won't decrease prediction latency; the algorithm requires pre-existing parallelism of some kind (either farming out the parallel evaluation or perhaps a multi-node pipeline architecture). Based on my understanding of llama.cpp's architecture, it doesn't seem like a great fit; but maybe there's a modification that can be made to work? |
It increases overall computation but also increases parallelisation of inference on the main model so it can still be faster |
Staged speculative decoding |
Hey hey -- I'm one of the authors of https://arxiv.org/abs/2302.01318. It's good to see the open source community pick up on this! I'm sadly not in a position to contribute directly, but since this is already on your radar:
|
@charliexchen Thanks for stopping by. We are getting close to have everything needed to implement this. Hopefully will have a prototype soon
If we can replicate the speed improvement factor on Apple Silicon + Metal, it would be a game changer. In your experience, if you are generating highly structured text (e.g. source code in some programming language), does it allow you to increase the size difference with the drafter significantly without losing the speed effect? I imagine this would be the case to some extend since there would be many "easy-to-predict" tokens in such cases. |
In our paper we got much higher speedups for the HumanEval code generation task compared to XSUM using the same model pairing, so acceptance rate is indeed rather task specific. If you have an "easier" task in some sense, then shrinking the drafter is absolutely on the table. |
@charliexchen Did you consider using the same model as a draft model? I mean after layer K < N, immediate sample the output to form a draft token. |
This seems related to CALM (Which is mentioned in one of the other threads). It should work, but you need to explicitly train/finetune the model to handle that. The nice thing about spec sampling is that you don't have to touch the target model at all. |
I'll try to do a PoC of speculative sampling today - will post a branch when I get something running |
Closed via #2926 |
Speculative sampling is explained here: https://arxiv.org/abs/2302.01318
In more simple terms here:
For start, the "draft" model can be generated using the train-text-from-scratch example using the same vocab as LLaMA. Later, we can try to utilize better models.
We also assume that batching multiple tokens with the "main" model is significantly faster compared to processing the tokens one-by-one. This may not yet be the case, but it will be when we close ggerganov/ggml#293
The text was updated successfully, but these errors were encountered: