Why Is vLLM Faster than llama.cpp on CPU? An Unexpected Surprise in Prompt Throughput Comparison (64 Cores, No GPU) #13849

vishal-zetta · 2025-05-28T10:14:11Z

vishal-zetta
May 28, 2025

Common setup:

Model: Llama-3.1-8B-Instruct-F16.gguf
Hardware: 64 CPU cores, 128GB RAM
GPU: None
Batch size: 1 (single prompt at a time)
Input tokens: ~4000
Output tokens: 1024

How to run:

llama.cpp: ./build/bin/llama-server -m ../Llama-3.1-8B-Instruct/Llama-3.1-8B-Instruct-F16.gguf --port 8081 -t 64 --ctx-size 10240
vLLM: vllm serve meta-llama/Llama-3.1-8B-Instruct --dtype float16 --max-model-len 10240 --enable-chunked-prefill --host 0.0.0.0 --port 8050 --max-num-batched-tokens 256

Observations:

vLLM achieves prompt throughput ~502 tokens/s
llama.cpp achieves prompt throughput ~217 tokens/s
Both use the same model and hardware, and both are running strictly on CPU with no GPU acceleration

I’ve attached screenshots below showing the prompt throughput from both llama.cpp and vLLM for reference:
llama.cpp prompt throughput [screenshot]

vLLM prompt throughput [screenshot]

Discussion Points & Questions:
I’m a bit surprised by the results I’m seeing and would love some input from the community:

Why is vLLM so much faster than llama.cpp for prompt throughput on CPU, even with batch size 1?
I always thought that llama.cpp, being designed for efficient CPU inference, would have the edge here, but vLLM is more than twice as fast in my tests.
Is there something fundamental about how vLLM handles prompt processing or memory management that gives it this advantage, even when we’re not batching multiple prompts?
I know vLLM is famous for its batching, but in this case, I’m only sending one prompt at a time.
Are there specific flags, build options, that could help llama.cpp close this gap?
I’m using 64 threads and the default settings, but maybe I’m missing some optimizations.
Has anyone else observed similar results, or found ways to boost llama.cpp’s prompt throughput on high-core-count CPUs?

Any insights, suggestions, or even wild guesses would be really appreciated!

Thanks in advance!

Note: This comparison focuses specifically on prompt throughput, where vLLM significantly outperforms llama.cpp despite both running on the same CPU-only hardware.

Manoj-red-hat · 2025-05-28T10:32:38Z

Manoj-red-hat
May 28, 2025

Its great observation, did you try -fa flash attention feature of llama.cpp?

FYI... @ggerganov

8 replies

vishal-zetta May 28, 2025
Author

Thanks for the tips!

I mistakenly shared the numbers for the Q4_0 model earlier—sorry about that.
For the F16 model, I’m getting around ~102 tokens/sec using BLAS. I also tried playing with the --threads-batch parameter, but reducing it actually made the throughput worse. It seems like keeping it at the default (matching the number of threads) gives the best results.

For context, my system has AMD CPU with 96 cores in total, and I’m running llama-server with 64 of those cores.

vishal-zetta May 29, 2025
Author

@ggerganov Just checking in to see if there are any suggestions on this.

ggerganov May 29, 2025
Maintainer

vLLM uses BF16 data and most likely, vLLM is optimized better for BF16 on this CPU. You can try running llama.cpp with an equivalent BF16 model/cache, though I'm not really sure what is the state of the BF16 implementation atm in llama.cpp - probably still missing a lot of optimizations.

Btw, I got confused about BLAS - it should not be needed for F16, so it's better to disable it.

You can try llama.cpp with a Q8_0 model - the perf should be improved compared to F16 and the quality should be the same.

vishal-zetta May 30, 2025
Author

I tried running llama.cpp with different precision: using BF16, prompt throughput actually dropped from 217 to 192 tokens/sec compared to F16. Switching to the Q8_0 model gave a nice boost, throughput improved from 217 to 262 tokens/sec but it’s still not comparable to vLLM performance on the same hardware.
I’m interested in contributing to BF16 optimization in llama.cpp. Are there any papers, technical docs, or resources you’d recommend to better understand inference on CPU, or the challenges llama.cpp faces here?

Djip007 Jun 1, 2025

though I'm not really sure what is the state of the BF16 implementation atm in llama.cpp - probably still missing a lot of optimizations.

I did report jar/llamafile optim for BF16 with this PR: #10714 😎 not more to expect, even true BLIS and repacking can only gain 10%.
(without the case of NUMA. I did not know if it affect high core count CPU in the same way it have with multi socket... )

JohannesGaessler · 2025-05-28T11:13:58Z

JohannesGaessler
May 28, 2025
Collaborator

How does the token generation speed compare?

1 reply

vishal-zetta May 28, 2025
Author

token generation speed for llama.cpp is actually very similar to vLLM in my tests. the main difference I noticed is in prompt throughput—vLLM is much faster there, but once generation starts, both perform about the same on CPU only.

Djip007 · 2025-06-01T00:01:47Z

Djip007
Jun 1, 2025

really strange for me... what is your CPU ?

can you bench with BF16 model

llama-bench -t "8,16,32,64,96" -r 3 -p "128,256,384,512,1024,2048,4096,3902" -n 16 -ctk bf16 -ctv bf16 -m Meta-Llama-3.1-8B-Instruct-BF16.gguf

in my case a zen4 laptop with 8 core and 2 memory channel for total of 64Go I get: (only for t 8 ;) )

model	size	params	backend	threads	type_k	type_v	test	t/s
llama 8B BF16	14.96 GiB	8.03 B	CPU	8	bf16	bf16	pp128	111.49 ± 0.19
llama 8B BF16	14.96 GiB	8.03 B	CPU	8	bf16	bf16	pp256	110.01 ± 1.35
llama 8B BF16	14.96 GiB	8.03 B	CPU	8	bf16	bf16	pp384	108.02 ± 0.15
llama 8B BF16	14.96 GiB	8.03 B	CPU	8	bf16	bf16	pp512	105.49 ± 0.15
llama 8B BF16	14.96 GiB	8.03 B	CPU	8	bf16	bf16	pp1024	103.87 ± 0.07
llama 8B BF16	14.96 GiB	8.03 B	CPU	8	bf16	bf16	pp2048	98.59 ± 0.04
llama 8B BF16	14.96 GiB	8.03 B	CPU	8	bf16	bf16	pp4096	90.54 ± 0.10
llama 8B BF16	14.96 GiB	8.03 B	CPU	8	bf16	bf16	pp3902	91.76 ± 0.01
llama 8B BF16	14.96 GiB	8.03 B	CPU	8	bf16	bf16	tg16	3.87 ± 0.00

For vllm, I am not sure we can have that perf with fp32 compute on zen4. but may be a zen 5 can do it.

for FP16 I have:

llama-bench -t "8,16,32,64,96" -r 3 -p "128,256,384,512,1024,2048,4096,3902" -n 16 -m Meta-Llama-3.1-8B-Instruct/F16.gguf

model	size	params	backend	threads	test	t/s
llama 8B F16	14.96 GiB	8.03 B	CPU	8	pp128	66.68 ± 0.08
llama 8B F16	14.96 GiB	8.03 B	CPU	8	pp256	64.51 ± 0.04
llama 8B F16	14.96 GiB	8.03 B	CPU	8	pp384	64.15 ± 0.08
llama 8B F16	14.96 GiB	8.03 B	CPU	8	pp512	63.30 ± 0.02
llama 8B F16	14.96 GiB	8.03 B	CPU	8	pp1024	62.34 ± 0.02
llama 8B F16	14.96 GiB	8.03 B	CPU	8	pp2048	60.19 ± 0.01
llama 8B F16	14.96 GiB	8.03 B	CPU	8	pp4096	55.30 ± 0.80
llama 8B F16	14.96 GiB	8.03 B	CPU	8	pp3902	56.66 ± 0.05
llama 8B F16	14.96 GiB	8.03 B	CPU	8	tg16	3.87 ± 0.01

I tried running llama.cpp with different precision: using BF16, prompt throughput actually dropped from 217 to 192 tokens/sec compared to F16. Switching to the Q8_0 model gave a nice boost, throughput improved from 217 to 262 tokens/sec but it’s still not comparable to vLLM performance on the same hardware.

It may have be build without AVX512 / AVX512_BF16 on zen4/5 the BF16 have to be much faster than FP16 what report the serveur?

> llama-server -m Meta-Llama-3.1-8B-Instruct/BF16.gguf 
build: 5437 (85cd340d3) with cc (GCC) 15.1.1 20250521 (Red Hat 15.1.1-2) for x86_64-redhat-linux
system info: n_threads = 8, n_threads_batch = 8, total_threads = 16

system_info: n_threads = 8 (n_threads_batch = 8) / 16 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

I know we can have some gain ~10/20% with true BLIS mulmat compute. and may be some more on heigh core count CPU with NUMA or L3 placement, but this more sensible for dual socket.

and last:

llama-bench -t "8,16,32,64,96" -r 3 -p "128,256,384,512,1024,2048,4096,3902" -n 16 -ctk bf16 -ctv bf16 -m Meta-Llama-3.1-8B-Instruct/Q8_0.gguf

model	size	params	backend	threads	type_k	type_v	test	t/s
llama 8B Q8_0	7.95 GiB	8.03 B	CPU	8	bf16	bf16	pp128	76.26 ± 0.19
llama 8B Q8_0	7.95 GiB	8.03 B	CPU	8	bf16	bf16	pp256	69.25 ± 1.87
llama 8B Q8_0	7.95 GiB	8.03 B	CPU	8	bf16	bf16	pp384	68.53 ± 0.26
llama 8B Q8_0	7.95 GiB	8.03 B	CPU	8	bf16	bf16	pp512	68.15 ± 0.08
llama 8B Q8_0	7.95 GiB	8.03 B	CPU	8	bf16	bf16	pp1024	66.86 ± 0.14
llama 8B Q8_0	7.95 GiB	8.03 B	CPU	8	bf16	bf16	pp2048	64.77 ± 0.09
llama 8B Q8_0	7.95 GiB	8.03 B	CPU	8	bf16	bf16	pp4096	60.92 ± 0.05
llama 8B Q8_0	7.95 GiB	8.03 B	CPU	8	bf16	bf16	pp3902	61.57 ± 0.02
llama 8B Q8_0	7.95 GiB	8.03 B	CPU	8	bf16	bf16	tg16	7.19 ± 0.01

8 replies

Djip007 Jun 2, 2025

I did not know wy there is so much did from benchmark and server ... may be @ggerganov have a idea?

Thanks for the zen 5 bench. Can you run it with FP16? I do not know if the gain is with memory speed or full AVX512 path (but the vllm look like there is a full AVX512 path on zen5... nice).

I look if I can find time for a POC.

vishal-zetta Jun 2, 2025
Author

Here are the FP16 benchmark results on Zen 5 with 64 threads:

numactl --physcpubind=128-191 --membind=1 ./build/bin/llama-bench -t "64" -r 3 -p "128,256,384,512,1024,2048,4096,3902" -n 16 -m ../Llama-3.1-8B-Instruct/Llama-3.1-8B-Instruct-F16.gguf

model	size	params	backend	threads	test	t/s
llama 8B F16	14.96 GiB	8.03 B	CPU	64	pp128	366.69 ± 4.00
llama 8B F16	14.96 GiB	8.03 B	CPU	64	pp256	404.58 ± 0.69
llama 8B F16	14.96 GiB	8.03 B	CPU	64	pp384	415.91 ± 3.62
llama 8B F16	14.96 GiB	8.03 B	CPU	64	pp512	414.85 ± 0.86
llama 8B F16	14.96 GiB	8.03 B	CPU	64	pp1024	417.68 ± 1.16
llama 8B F16	14.96 GiB	8.03 B	CPU	64	pp2048	405.26 ± 0.26
llama 8B F16	14.96 GiB	8.03 B	CPU	64	pp4096	373.38 ± 0.37
llama 8B F16	14.96 GiB	8.03 B	CPU	64	pp3902	375.66 ± 0.50
llama 8B F16	14.96 GiB	8.03 B	CPU	64	tg16	16.97 ± 0.00

So, yes the FP16 throughput is lower than BF16 on Zen 5

ggerganov Jun 2, 2025
Maintainer

@Djip007 The server was started without the -ctk bf16 -ctv bf16 flags.

Djip007 Jun 2, 2025

Oh yes, that's a good point, I missed it. 298.16 vs 315.65 without -ctk bf16 -ctv bf16 seems normal.

vishal-zetta Jun 3, 2025
Author

the benchmark numbers are the same whether I use -ctk bf16 -ctv bf16 or not in both llama-bench and llama-server

Why Is vLLM Faster than llama.cpp on CPU? An Unexpected Surprise in Prompt Throughput Comparison (64 Cores, No GPU) #13849

Uh oh!

Uh oh!

vishal-zetta May 28, 2025

Replies: 3 comments · 17 replies

Uh oh!

Manoj-red-hat May 28, 2025

Uh oh!

Uh oh!

vishal-zetta May 28, 2025 Author

Uh oh!

vishal-zetta May 29, 2025 Author

Uh oh!

ggerganov May 29, 2025 Maintainer

Uh oh!

vishal-zetta May 30, 2025 Author

Uh oh!

Djip007 Jun 1, 2025

Uh oh!

JohannesGaessler May 28, 2025 Collaborator

Uh oh!

vishal-zetta May 28, 2025 Author

Uh oh!

Uh oh!

Djip007 Jun 1, 2025

Uh oh!

Djip007 Jun 2, 2025

Uh oh!

vishal-zetta Jun 2, 2025 Author

Uh oh!

ggerganov Jun 2, 2025 Maintainer

Uh oh!

Djip007 Jun 2, 2025

Uh oh!

vishal-zetta Jun 3, 2025 Author

vishal-zetta
May 28, 2025

Replies: 3 comments 17 replies

Manoj-red-hat
May 28, 2025

vishal-zetta May 28, 2025
Author

vishal-zetta May 29, 2025
Author

ggerganov May 29, 2025
Maintainer

vishal-zetta May 30, 2025
Author

JohannesGaessler
May 28, 2025
Collaborator

vishal-zetta May 28, 2025
Author

Djip007
Jun 1, 2025

vishal-zetta Jun 2, 2025
Author

ggerganov Jun 2, 2025
Maintainer

vishal-zetta Jun 3, 2025
Author