Why Is vLLM Faster than llama.cpp on CPU? An Unexpected Surprise in Prompt Throughput Comparison (64 Cores, No GPU) #13849
Replies: 3 comments 17 replies
-
Its great observation, did you try -fa flash attention feature of llama.cpp? FYI... @ggerganov |
Beta Was this translation helpful? Give feedback.
-
How does the token generation speed compare? |
Beta Was this translation helpful? Give feedback.
-
really strange for me... what is your CPU ? can you bench with BF16 model
in my case a zen4 laptop with 8 core and 2 memory channel for total of 64Go I get: (only for t 8 ;) )
For vllm, I am not sure we can have that perf with fp32 compute on zen4. but may be a zen 5 can do it. for FP16 I have:
It may have be build without AVX512 / AVX512_BF16 on zen4/5 the BF16 have to be much faster than FP16 what report the serveur?
I know we can have some gain ~10/20% with true BLIS mulmat compute. and may be some more on heigh core count CPU with NUMA or L3 placement, but this more sensible for dual socket. and last:
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Common setup:
How to run:
./build/bin/llama-server -m ../Llama-3.1-8B-Instruct/Llama-3.1-8B-Instruct-F16.gguf --port 8081 -t 64 --ctx-size 10240
vllm serve meta-llama/Llama-3.1-8B-Instruct --dtype float16 --max-model-len 10240 --enable-chunked-prefill --host 0.0.0.0 --port 8050 --max-num-batched-tokens 256
Observations:
I’ve attached screenshots below showing the prompt throughput from both llama.cpp and vLLM for reference:

llama.cpp prompt throughput [screenshot]
vLLM prompt throughput [screenshot]

Discussion Points & Questions:
I’m a bit surprised by the results I’m seeing and would love some input from the community:
I always thought that llama.cpp, being designed for efficient CPU inference, would have the edge here, but vLLM is more than twice as fast in my tests.
I know vLLM is famous for its batching, but in this case, I’m only sending one prompt at a time.
I’m using 64 threads and the default settings, but maybe I’m missing some optimizations.
Any insights, suggestions, or even wild guesses would be really appreciated!
Thanks in advance!
Beta Was this translation helpful? Give feedback.
All reactions