-
Notifications
You must be signed in to change notification settings - Fork 10.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Prompt Processing] Is there a way to speed up prompt processing for Metal? (M1/M2) #2428
Comments
I've noticed this myself - I looked at Karpathy's llama2 implementation and noticed that llama.cpp is not using the usual SentencePiece lookup stuff from HF and therefore tokenization is much slower. Longer prompts are almost unusable However I seem to have a bad link to ggml-vocab.bin so maybe that is my problem Not sure if this discussion is still relevant |
this is most likely not an issue with the tokenizer. |
Have you tried CLBlast? It seems OpenCL support on M1/M2 is hit or miss but on supported GPU platforms it helps a lot with prompt processing speed. |
Is it possible to use CL Blast just for prompt processing? |
Currently I don't think it's possible (if you use CLBlast you have to use OpenCL or CPU for inference). However if you get great results with CLBlast processing + Metal inference likely someone will put in the PR for it 😁 |
@Jchang4 are you using a batch size of 1? |
I think batch size 8 is the default for llama-cpp-python. Let me try with 1 edit:
Is this normal? |
Yeah. It's a bug. But that means it's using metal (GPU) prompt evaluation. If you want to see your tokens per second then just add "-n 1" (limit number of tokens to 1). The eval time will show you your "ms per token" / "tokens per second" for comparison purposes to CPU. |
Huge improvement thanks @colinc ! |
Just wanted to circle back to this. The recent commit (bf83bff) by @lshzh-ww completely invalidates the advice to use a batch size of 1. I'm still running performance tuning to understand the full impact, however, it's a signifiant increase. Prompt length does seem to impact overall TPS. |
The default setting for |
@michaeljelly While I agree with @lshzh-ww that the 512 default is fine, we've found that for our specific use case and setup, a batch size of 224 has a performance increase of ~4.5% over a batch size of 512. But this may be super specific to: M[x] Ultra - 800GB/s memory bandwidth so you may have to run your own performance tuning. The llama-bench utility that was recently added is extremely helpful. I ran quick test, with prompt lengths varying from 350~1750 and batch sizes of 224, 256, and 512 so you could see the tokens per second of a 70B Llama 2 model @ q6_K quantization on an M2 Ultra: So you can see that 224 & 256 beat out 512 marginally and, for the majority of the time, 224 slightly out performs 256. Hope that helps. edit: I should have put a legend on that chart, for clarity. x-axis is number of prompt tokens; y-axis is the tokens per second for prompt processing; chart colors represent the batch size: 224, 256, or 512. |
Nice, super helpful! Currently the options I'm using are these: --threads 8 --ctx-size 2048 --n-gpu-layers 1 -b 224 I assume there's no other magic configuration I'm missing that'll speed things up, it's already super cool! Using server.cpp, and planning on implementing some kind of multi-prompt caching into the server at some point if I have the time. If there's any other way to optimize the prompt processing for Metal/in general, let me know! Thanks so much to @lshzh-ww for your work on speeding it up! |
Yep, that's basically it. But, again, for you and any future visitors to this thread, this information may not be specific to your setup and may be out of date on Monday. 😅 I'm incredibly appreciative to everyone who's made, and continue to make, amazing strides on and related projects. @lshzh-ww work, for example, resulted in a 4x-10x speed increase for the 70b model. Amazing. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
Prompt eval time is around twice as long as eval time (12 tokens/sec vs 22 tokens/sec). Is there a way to make them both the same speed?
Current Behavior
Prompt eval time takes twice as long as eval time.
The text was updated successfully, but these errors were encountered: