Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Prompt Processing] Is there a way to speed up prompt processing for Metal? (M1/M2) #2428

Closed
4 tasks done
Jchang4 opened this issue Jul 27, 2023 · 16 comments
Closed
4 tasks done
Labels

Comments

@Jchang4
Copy link

Jchang4 commented Jul 27, 2023

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

Prompt eval time is around twice as long as eval time (12 tokens/sec vs 22 tokens/sec). Is there a way to make them both the same speed?

Current Behavior

Prompt eval time takes twice as long as eval time.

@ProjectAtlantis-dev
Copy link

ProjectAtlantis-dev commented Jul 29, 2023

I've noticed this myself - I looked at Karpathy's llama2 implementation and noticed that llama.cpp is not using the usual SentencePiece lookup stuff from HF and therefore tokenization is much slower. Longer prompts are almost unusable

However I seem to have a bad link to ggml-vocab.bin so maybe that is my problem

Not sure if this discussion is still relevant
#252

@daboe01
Copy link
Contributor

daboe01 commented Jul 29, 2023

this is most likely not an issue with the tokenizer.
i think it is because a GEMM kernel that works with quantized data is currently missing for this to happen.

@netrunnereve
Copy link
Collaborator

Have you tried CLBlast? It seems OpenCL support on M1/M2 is hit or miss but on supported GPU platforms it helps a lot with prompt processing speed.

@Jchang4
Copy link
Author

Jchang4 commented Jul 29, 2023

Have you tried CLBlast? It seems OpenCL support on M1/M2 is hit or miss but on supported GPU platforms it helps a lot with prompt processing speed.

Is it possible to use CL Blast just for prompt processing?

@netrunnereve
Copy link
Collaborator

Currently I don't think it's possible (if you use CLBlast you have to use OpenCL or CPU for inference). However if you get great results with CLBlast processing + Metal inference likely someone will put in the PR for it 😁

@colinc
Copy link

colinc commented Aug 2, 2023

@Jchang4 are you using a batch size of 1?

@Jchang4
Copy link
Author

Jchang4 commented Aug 2, 2023

@Jchang4 are you using a batch size of 1?

I think batch size 8 is the default for llama-cpp-python. Let me try with 1

edit:

prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)

Is this normal?

@colinc
Copy link

colinc commented Aug 2, 2023

Yeah. It's a bug. But that means it's using metal (GPU) prompt evaluation. If you want to see your tokens per second then just add "-n 1" (limit number of tokens to 1). The eval time will show you your "ms per token" / "tokens per second" for comparison purposes to CPU.

@Jchang4
Copy link
Author

Jchang4 commented Aug 2, 2023

Huge improvement thanks @colinc !

@colinc
Copy link

colinc commented Aug 17, 2023

Just wanted to circle back to this. The recent commit (bf83bff) by @lshzh-ww completely invalidates the advice to use a batch size of 1.

I'm still running performance tuning to understand the full impact, however, it's a signifiant increase. Prompt length does seem to impact overall TPS.

@michaeljelly
Copy link

Hey @colinc curious what impact you're expecting here, and what your intuition would be on what the new best batch size might be? @lshzh-ww you may have some opinions and insight here too having just done the work!

@lshzh-ww
Copy link
Contributor

The default setting for n_batch=512 should be good, or you can adjust it to any value that is divisible by 32.

@colinc
Copy link

colinc commented Aug 19, 2023

@michaeljelly While I agree with @lshzh-ww that the 512 default is fine, we've found that for our specific use case and setup, a batch size of 224 has a performance increase of ~4.5% over a batch size of 512. But this may be super specific to:

M[x] Ultra - 800GB/s memory bandwidth
Highly variable prompt lengths of 750 - 4000 tokens

so you may have to run your own performance tuning.

The llama-bench utility that was recently added is extremely helpful.
The PerformanceTuning.ipynb notebook in the llama-cpp-python project is also a great starting point (you'll likely want to modify that to support variable prompt sizes, and ignore the rest of the parameters in the example).

I ran quick test, with prompt lengths varying from 350~1750 and batch sizes of 224, 256, and 512 so you could see the tokens per second of a 70B Llama 2 model @ q6_K quantization on an M2 Ultra:

chart

So you can see that 224 & 256 beat out 512 marginally and, for the majority of the time, 224 slightly out performs 256.

Hope that helps.

edit: I should have put a legend on that chart, for clarity. x-axis is number of prompt tokens; y-axis is the tokens per second for prompt processing; chart colors represent the batch size: 224, 256, or 512.

@michaeljelly
Copy link

Nice, super helpful! Currently the options I'm using are these: --threads 8 --ctx-size 2048 --n-gpu-layers 1 -b 224

I assume there's no other magic configuration I'm missing that'll speed things up, it's already super cool! Using server.cpp, and planning on implementing some kind of multi-prompt caching into the server at some point if I have the time.

If there's any other way to optimize the prompt processing for Metal/in general, let me know! Thanks so much to @lshzh-ww for your work on speeding it up!

@colinc
Copy link

colinc commented Aug 19, 2023

Yep, that's basically it. But, again, for you and any future visitors to this thread, this information may not be specific to your setup and may be out of date on Monday. 😅 I'm incredibly appreciative to everyone who's made, and continue to make, amazing strides on and related projects. @lshzh-ww work, for example, resulted in a 4x-10x speed increase for the 70b model. Amazing.

Copy link
Contributor

github-actions bot commented Apr 9, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Apr 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants