[Prompt Processing] Is there a way to speed up prompt processing for Metal? (M1/M2) #2428

Jchang4 · 2023-07-27T22:00:17Z

Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

Prompt eval time is around twice as long as eval time (12 tokens/sec vs 22 tokens/sec). Is there a way to make them both the same speed?

Current Behavior

Prompt eval time takes twice as long as eval time.

ProjectAtlantis-dev · 2023-07-29T08:02:49Z

I've noticed this myself - I looked at Karpathy's llama2 implementation and noticed that llama.cpp is not using the usual SentencePiece lookup stuff from HF and therefore tokenization is much slower. Longer prompts are almost unusable

However I seem to have a bad link to ggml-vocab.bin so maybe that is my problem

Not sure if this discussion is still relevant
#252

daboe01 · 2023-07-29T08:39:11Z

this is most likely not an issue with the tokenizer.
i think it is because a GEMM kernel that works with quantized data is currently missing for this to happen.

netrunnereve · 2023-07-29T16:46:59Z

Have you tried CLBlast? It seems OpenCL support on M1/M2 is hit or miss but on supported GPU platforms it helps a lot with prompt processing speed.

Jchang4 · 2023-07-29T17:20:23Z

Have you tried CLBlast? It seems OpenCL support on M1/M2 is hit or miss but on supported GPU platforms it helps a lot with prompt processing speed.

Is it possible to use CL Blast just for prompt processing?

netrunnereve · 2023-07-29T21:15:44Z

Currently I don't think it's possible (if you use CLBlast you have to use OpenCL or CPU for inference). However if you get great results with CLBlast processing + Metal inference likely someone will put in the PR for it 😁

colinc · 2023-08-02T14:40:40Z

@Jchang4 are you using a batch size of 1?

Jchang4 · 2023-08-02T14:51:49Z

@Jchang4 are you using a batch size of 1?

I think batch size 8 is the default for llama-cpp-python. Let me try with 1

edit:

prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)

Is this normal?

colinc · 2023-08-02T19:04:58Z

Yeah. It's a bug. But that means it's using metal (GPU) prompt evaluation. If you want to see your tokens per second then just add "-n 1" (limit number of tokens to 1). The eval time will show you your "ms per token" / "tokens per second" for comparison purposes to CPU.

Jchang4 · 2023-08-02T22:04:44Z

Huge improvement thanks @colinc !

colinc · 2023-08-17T00:15:11Z

Just wanted to circle back to this. The recent commit (bf83bff) by @lshzh-ww completely invalidates the advice to use a batch size of 1.

I'm still running performance tuning to understand the full impact, however, it's a signifiant increase. Prompt length does seem to impact overall TPS.

michaeljelly · 2023-08-18T13:40:49Z

Hey @colinc curious what impact you're expecting here, and what your intuition would be on what the new best batch size might be? @lshzh-ww you may have some opinions and insight here too having just done the work!

lshzh-ww · 2023-08-18T14:17:55Z

The default setting for n_batch=512 should be good, or you can adjust it to any value that is divisible by 32.

colinc · 2023-08-19T17:06:17Z

@michaeljelly While I agree with @lshzh-ww that the 512 default is fine, we've found that for our specific use case and setup, a batch size of 224 has a performance increase of ~4.5% over a batch size of 512. But this may be super specific to:

M[x] Ultra - 800GB/s memory bandwidth
Highly variable prompt lengths of 750 - 4000 tokens

so you may have to run your own performance tuning.

The llama-bench utility that was recently added is extremely helpful.
The PerformanceTuning.ipynb notebook in the llama-cpp-python project is also a great starting point (you'll likely want to modify that to support variable prompt sizes, and ignore the rest of the parameters in the example).

I ran quick test, with prompt lengths varying from 350~1750 and batch sizes of 224, 256, and 512 so you could see the tokens per second of a 70B Llama 2 model @ q6_K quantization on an M2 Ultra:

So you can see that 224 & 256 beat out 512 marginally and, for the majority of the time, 224 slightly out performs 256.

Hope that helps.

edit: I should have put a legend on that chart, for clarity. x-axis is number of prompt tokens; y-axis is the tokens per second for prompt processing; chart colors represent the batch size: 224, 256, or 512.

michaeljelly · 2023-08-19T19:14:57Z

Nice, super helpful! Currently the options I'm using are these: --threads 8 --ctx-size 2048 --n-gpu-layers 1 -b 224

I assume there's no other magic configuration I'm missing that'll speed things up, it's already super cool! Using server.cpp, and planning on implementing some kind of multi-prompt caching into the server at some point if I have the time.

If there's any other way to optimize the prompt processing for Metal/in general, let me know! Thanks so much to @lshzh-ww for your work on speeding it up!

colinc · 2023-08-19T21:48:20Z

Yep, that's basically it. But, again, for you and any future visitors to this thread, this information may not be specific to your setup and may be out of date on Monday. 😅 I'm incredibly appreciative to everyone who's made, and continue to make, amazing strides on and related projects. @lshzh-ww work, for example, resulted in a 4x-10x speed increase for the 70b model. Amazing.

github-actions · 2024-04-09T01:07:21Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

philpax mentioned this issue Aug 12, 2023

Metal Prompt Feeding rustformers/llm#403

Open

ProjectAtlantis-dev mentioned this issue Aug 20, 2023

Metal prompt processing / inference intermittently spins but doesn't produce output #2678

Closed

github-actions bot added the stale label Mar 25, 2024

github-actions bot closed this as completed Apr 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Prompt Processing] Is there a way to speed up prompt processing for Metal? (M1/M2) #2428

[Prompt Processing] Is there a way to speed up prompt processing for Metal? (M1/M2) #2428

Jchang4 commented Jul 27, 2023

ProjectAtlantis-dev commented Jul 29, 2023 •

edited

Loading

daboe01 commented Jul 29, 2023

netrunnereve commented Jul 29, 2023

Jchang4 commented Jul 29, 2023

netrunnereve commented Jul 29, 2023

colinc commented Aug 2, 2023

Jchang4 commented Aug 2, 2023 •

edited

Loading

colinc commented Aug 2, 2023 •

edited

Loading

Jchang4 commented Aug 2, 2023

colinc commented Aug 17, 2023

michaeljelly commented Aug 18, 2023

lshzh-ww commented Aug 18, 2023

colinc commented Aug 19, 2023 •

edited

Loading

michaeljelly commented Aug 19, 2023

colinc commented Aug 19, 2023

github-actions bot commented Apr 9, 2024

[Prompt Processing] Is there a way to speed up prompt processing for Metal? (M1/M2) #2428

[Prompt Processing] Is there a way to speed up prompt processing for Metal? (M1/M2) #2428

Comments

Jchang4 commented Jul 27, 2023

Prerequisites

Expected Behavior

Current Behavior

ProjectAtlantis-dev commented Jul 29, 2023 • edited Loading

daboe01 commented Jul 29, 2023

netrunnereve commented Jul 29, 2023

Jchang4 commented Jul 29, 2023

netrunnereve commented Jul 29, 2023

colinc commented Aug 2, 2023

Jchang4 commented Aug 2, 2023 • edited Loading

colinc commented Aug 2, 2023 • edited Loading

Jchang4 commented Aug 2, 2023

colinc commented Aug 17, 2023

michaeljelly commented Aug 18, 2023

lshzh-ww commented Aug 18, 2023

colinc commented Aug 19, 2023 • edited Loading

michaeljelly commented Aug 19, 2023

colinc commented Aug 19, 2023

github-actions bot commented Apr 9, 2024

ProjectAtlantis-dev commented Jul 29, 2023 •

edited

Loading

Jchang4 commented Aug 2, 2023 •

edited

Loading

colinc commented Aug 2, 2023 •

edited

Loading

colinc commented Aug 19, 2023 •

edited

Loading