Performance benchmarks: Nekko vs vllM vs oLLAMA#97
Conversation
07ac258 to
95529d2
Compare
8acefd2 to
a9922d1
Compare
a9922d1 to
9e832a0
Compare
vidas
left a comment
There was a problem hiding this comment.
When looking at results (generated on my machine) there are few obviuos issues:
- CPU/RAM measurements don't work
- ollama TTFT is zero (can't be true)
- there is very strong prompt caching effect (at least for ollama and nekko), as the first request is much faster.
Summary:
System Info,Value
CPU Cores,16
Total RAM (MB),15206
OS Version,Linux 6.12.10-arch1-1
Architecture,x86_64
Metric,Scenario,Nekko,Ollama,Vllm
Throughput (Tokens/sec),High concurrency,6.955930550172383,30.228510601085667,16.598440030770078
Throughput (Tokens/sec),Long prompt output,11.422343648611303,26.13431460304411,28.17629839704017
Throughput (Tokens/sec),Medium prompt output,10.98778052645897,25.855812022353348,24.225328046311127
Throughput (Tokens/sec),Short prompt output,11.529355170327143,27.33676320612099,25.404445179090807
Latency (Time to First Token - ms),High concurrency,456.61954302340746,0.0,161.29064094275236
Latency (Time to First Token - ms),Long prompt output,98.13516697613522,0.0,59.875387989450246
Latency (Time to First Token - ms),Medium prompt output,100.54681697511114,0.0,85.24371474049985
Latency (Time to First Token - ms),Short prompt output,215.6578809954226,0.0,72.28620845125988
Latency (Time to Complete Response - ms),High concurrency,1437.6221740385517,33.081352012231946,1385.6723859207705
Latency (Time to Complete Response - ms),Long prompt output,1901.5681465389207,40.64613702939823,3235.065012704581
Latency (Time to Complete Response - ms),Medium prompt output,2769.0693304757588,39.86884403275326,1788.4064214886166
Latency (Time to Complete Response - ms),Short prompt output,1884.335668233689,3755.463357985718,945.0819717312697
CPU Usage (%),High concurrency,N/A,N/A,N/A
CPU Usage (%),Long prompt output,N/A,N/A,N/A
CPU Usage (%),Medium prompt output,N/A,N/A,N/A
CPU Usage (%),Short prompt output,N/A,N/A,N/A
RAM Usage (GB),High concurrency,N/A,N/A,N/A
RAM Usage (GB),Long prompt output,N/A,N/A,N/A
RAM Usage (GB),Medium prompt output,N/A,N/A,N/A
RAM Usage (GB),Short prompt output,N/A,N/A,N/A
Potential hint to ollama problem:

Please attach full benchmarking results on your machine for comparison, as this may be architecture/config related.
| @@ -0,0 +1,19 @@ | |||
| receivers: | |||
There was a problem hiding this comment.
Why do we need otel during benchmarking?
There was a problem hiding this comment.
We need this to display nekko logs when we run benchmarking
|
hey @vidas i missed a file name, ouput with updated veersion: |
801838a to
b169191
Compare
b169191 to
5435b83
Compare
vidas
left a comment
There was a problem hiding this comment.
Still some issues with ollama - latency (both TTFT and time to complete response) is unrealistic. Otherwise pretty nice.
benchmarks/docker-compose.yml
Outdated
| command: [ | ||
| # If you have more than 10 cores, it may request too much RAM. Uncomment it | ||
| # "--max-batch-size", "1", | ||
| # "--max-total-tokens", "8192", |
There was a problem hiding this comment.
tgi_api | 2025-02-19T15:25:19.154557Z INFO llamacpp: backends/llamacpp/src/backend.rs:216: llama_init_from_model: n_ctx = 2048
by default we have 2048 but nekko config contains "n_ctx": 8192
There was a problem hiding this comment.
2025-02-19T15:25:19.154557Z INFO llamacpp: backends/llamacpp/src/backend.rs:216: llama_init_from_model: n_ctx = 2048
by default we have 2048 but nekko config contains "n_ctx": 8192
|
@vidas thanks for review, will check ollama (TTFT and time to complete response) i believe it's related to some custom field names |
d08f0cc to
de53f24
Compare
…-test Performance benchmarks: Nekko vs vllM vs oLLAMA
Summary:
This PR adds a new performance testing framework for LLMs, including configurations and dependencies to benchmark different LLM APIs.
Changes Made:
.gitignore:
Excludes performance test artifacts (
benchmarks/results/*,benchmarks/models,benchmarks/build).Dependencies:
Integrates
ollama,vllmandllmperffor benchmarking.How to Test:
benchmarks/directory.make.results/directory.Additional Notes: