[User] Inference time GPU and CPU #1727

realcarlos · 2023-06-07T02:59:28Z

LLAMA_METAL=1 make -j && ./main -m ./models/guanaco-7B.ggmlv3.q4_0.bin -p "I love fish" --ignore-eos -n 1024 -ngl 1

llama_print_timings: load time = 7918.69 ms
llama_print_timings: sample time = 1013.54 ms / 1024 runs ( 0.99 ms per token)
llama_print_timings: prompt eval time = 14705.49 ms / 775 tokens ( 18.97 ms per token)
llama_print_timings: eval time = 46435.82 ms / 1020 runs ( 45.53 ms per token)
llama_print_timings: total time = 69981.58 ms

my question is , it seems that the eval time is same on CPU, is it normal?

Macbook pro M1 , 32GB

ggerganov · 2023-06-07T03:38:53Z

Yup, on M1 Pro I also get similar time for 8 thread CPU compared to GPU - ~45 ms / tok
My explanation is that the CPU and GPU share 100 GB/s bandwidth each from the total 200 GB/s of M1 Pro so parity is expected for this machine

realcarlos · 2023-06-07T09:15:52Z

Yup, on M1 Pro I also get similar time for 8 thread CPU compared to GPU - ~45 ms / tok My explanation is that the CPU and GPU share 100 GB/s bandwidth each from the total 200 GB/s of M1 Pro so parity is expected for this machine

Got it ,Sir !

realcarlos closed this as completed Jun 7, 2023

Bearsaerker mentioned this issue Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[User] Inference time GPU and CPU #1727

[User] Inference time GPU and CPU #1727

realcarlos commented Jun 7, 2023

ggerganov commented Jun 7, 2023

realcarlos commented Jun 7, 2023

[User] Inference time GPU and CPU #1727

[User] Inference time GPU and CPU #1727

Comments

realcarlos commented Jun 7, 2023

ggerganov commented Jun 7, 2023

realcarlos commented Jun 7, 2023