Performance benchmarks: Nekko vs vllM vs oLLAMA by AntonStasheuski · Pull Request #97 · aifoundry-org/NekkoAPI

AntonStasheuski · 2025-02-11T14:29:20Z

Summary:
This PR adds a new performance testing framework for LLMs, including configurations and dependencies to benchmark different LLM APIs.

Changes Made:

.gitignore:
Excludes performance test artifacts (benchmarks/results/*, benchmarks/models, benchmarks/build).
Dependencies:
Integrates ollama, vllm and llmperf for benchmarking.

How to Test:

Navigate to the benchmarks/ directory.
Run tests using make.
Review results in the results/ directory.

Additional Notes:

Supports high concurrency and multiple scenarios with retry logic.
Captures both performance and system metrics for each run.

perf_test/Makefile

.gitmodules

vidas

When looking at results (generated on my machine) there are few obviuos issues:

CPU/RAM measurements don't work
ollama TTFT is zero (can't be true)
there is very strong prompt caching effect (at least for ollama and nekko), as the first request is much faster.

Summary:

System Info,Value
CPU Cores,16
Total RAM (MB),15206
OS Version,Linux 6.12.10-arch1-1
Architecture,x86_64

Metric,Scenario,Nekko,Ollama,Vllm
Throughput (Tokens/sec),High concurrency,6.955930550172383,30.228510601085667,16.598440030770078
Throughput (Tokens/sec),Long prompt output,11.422343648611303,26.13431460304411,28.17629839704017
Throughput (Tokens/sec),Medium prompt output,10.98778052645897,25.855812022353348,24.225328046311127
Throughput (Tokens/sec),Short prompt output,11.529355170327143,27.33676320612099,25.404445179090807
Latency (Time to First Token - ms),High concurrency,456.61954302340746,0.0,161.29064094275236
Latency (Time to First Token - ms),Long prompt output,98.13516697613522,0.0,59.875387989450246
Latency (Time to First Token - ms),Medium prompt output,100.54681697511114,0.0,85.24371474049985
Latency (Time to First Token - ms),Short prompt output,215.6578809954226,0.0,72.28620845125988
Latency (Time to Complete Response - ms),High concurrency,1437.6221740385517,33.081352012231946,1385.6723859207705
Latency (Time to Complete Response - ms),Long prompt output,1901.5681465389207,40.64613702939823,3235.065012704581
Latency (Time to Complete Response - ms),Medium prompt output,2769.0693304757588,39.86884403275326,1788.4064214886166
Latency (Time to Complete Response - ms),Short prompt output,1884.335668233689,3755.463357985718,945.0819717312697
CPU Usage (%),High concurrency,N/A,N/A,N/A
CPU Usage (%),Long prompt output,N/A,N/A,N/A
CPU Usage (%),Medium prompt output,N/A,N/A,N/A
CPU Usage (%),Short prompt output,N/A,N/A,N/A
RAM Usage (GB),High concurrency,N/A,N/A,N/A
RAM Usage (GB),Long prompt output,N/A,N/A,N/A
RAM Usage (GB),Medium prompt output,N/A,N/A,N/A
RAM Usage (GB),Short prompt output,N/A,N/A,N/A

Potential hint to ollama problem:

Please attach full benchmarking results on your machine for comparison, as this may be architecture/config related.

benchmarks/README.md

benchmarks/config/specs.yaml

vidas · 2025-02-17T06:51:11Z

benchmarks/config/otel-collector-config.yaml

@@ -0,0 +1,19 @@
+receivers:


Why do we need otel during benchmarking?

We need this to display nekko logs when we run benchmarking

AntonStasheuski · 2025-02-17T11:19:21Z

hey @vidas i missed a file name, ouput with updated veersion:

System Info,Value
CPU Cores,8
Total RAM (MB),31955
OS Version,Linux 6.8.0-52-generic
Architecture,x86_64

Metric,Scenario,Nekko,Ollama,Vllm
Throughput (Tokens/sec),High concurrency,14.060101525569172,57.8191161333023,42.90549281716836
Throughput (Tokens/sec),Long prompt output,13.76369740541185,77.82620009576904,58.56474042768487
Throughput (Tokens/sec),Medium prompt output,13.705738683833564,82.38649893336279,61.806509350824285
Throughput (Tokens/sec),Short prompt output,12.642000271562939,61.960497246750236,59.10243276592215
Latency (Time to First Token - ms),High concurrency,94.22089700001379,0.0,46.67354399998658
Latency (Time to First Token - ms),Long prompt output,77.07396300000369,0.0,30.425740749990382
Latency (Time to First Token - ms),Medium prompt output,77.75769825002499,0.0,28.08553100004474
Latency (Time to First Token - ms),Short prompt output,237.4161849999723,0.0,36.41761075004979
Latency (Time to Complete Response - ms),High concurrency,1493.5880770000267,17.29531799992401,536.0618999998223
Latency (Time to Complete Response - ms),Long prompt output,1879.9521374999983,12.941063749963178,1389.5126047499957
Latency (Time to Complete Response - ms),Medium prompt output,2455.7411917500076,12.140223000017158,896.9998055000019
Latency (Time to Complete Response - ms),Short prompt output,1452.324286999982,394.02845624999827,398.3472177500289
CPU Usage (%),High concurrency,48.339987452948556,0.6939874213836478,15.11957286432161
CPU Usage (%),Long prompt output,50.40191194968553,0.6680880503144654,50.59575909661229
CPU Usage (%),Medium prompt output,50.55818411097099,2.7304568332685677,50.649170854271354
CPU Usage (%),Short prompt output,51.251984924623116,2.6900628140703517,48.042090680100756
RAM Usage (GB),High concurrency,0.4542884826660156,2.6390228271484375,9.412818908691406
RAM Usage (GB),Long prompt output,0.4504966735839844,2.6387100219726562,9.408592224121094
RAM Usage (GB),Medium prompt output,0.45035552978515625,2.638378143310547,9.408561706542969
RAM Usage (GB),Short prompt output,0.4385948181152344,2.6380615234375,9.408233642578125

vidas

Still some issues with ollama - latency (both TTFT and time to complete response) is unrealistic. Otherwise pretty nice.

benchmarks/docker-compose.yml

vidas · 2025-02-19T09:15:43Z

benchmarks/docker-compose.yml

+    command: [
+      # If you have more than 10 cores, it may request too much RAM. Uncomment it
+      # "--max-batch-size", "1",
+      # "--max-total-tokens", "8192",


This is never needed.

tgi_api | 2025-02-19T15:25:19.154557Z INFO llamacpp: backends/llamacpp/src/backend.rs:216: llama_init_from_model: n_ctx = 2048

by default we have 2048 but nekko config contains "n_ctx": 8192

2025-02-19T15:25:19.154557Z INFO llamacpp: backends/llamacpp/src/backend.rs:216: llama_init_from_model: n_ctx = 2048

by default we have 2048 but nekko config contains "n_ctx": 8192

AntonStasheuski · 2025-02-19T09:25:01Z

@vidas thanks for review, will check ollama (TTFT and time to complete response) i believe it's related to some custom field names

…s all llms

…-test Performance benchmarks: Nekko vs vllM vs oLLAMA

feat(Benchmarks): Add llmperf and compare vLLM with Nekko

95529d2

AntonStasheuski force-pushed the issue-92-vllm-vs-nekko-perf-test branch from 07ac258 to 95529d2 Compare February 11, 2025 14:32

vidas reviewed Feb 11, 2025

View reviewed changes

perf_test/Makefile Outdated Show resolved Hide resolved

vidas reviewed Feb 11, 2025

View reviewed changes

.gitmodules Outdated Show resolved Hide resolved

AntonStasheuski added 2 commits February 14, 2025 20:30

feat(oLLAMA): Add oLLAMA support

6178146

feat(oLLAMA): Add result combine && refactoring

961fb67

AntonStasheuski force-pushed the issue-92-vllm-vs-nekko-perf-test branch 2 times, most recently from 8acefd2 to a9922d1 Compare February 14, 2025 21:02

feat(oLLAMA): Combine results into single file && refactoring

9e832a0

AntonStasheuski force-pushed the issue-92-vllm-vs-nekko-perf-test branch from a9922d1 to 9e832a0 Compare February 14, 2025 21:03

AntonStasheuski changed the title ~~Performance tests: Nekko vs vllM~~ Performance tests: Nekko vs vllM vs oLLAMA Feb 14, 2025

AntonStasheuski changed the title ~~Performance tests: Nekko vs vllM vs oLLAMA~~ Performance benchmarks: Nekko vs vllM vs oLLAMA Feb 14, 2025

AntonStasheuski requested a review from vidas February 14, 2025 21:04

AntonStasheuski added the enhancement New feature or request label Feb 14, 2025

This was linked to issues Feb 14, 2025

Nekko vs TGI #91

Closed

Nekko vs vLLM #92

Closed

Nekko vs Ollama #98

Closed

vidas reviewed Feb 17, 2025

View reviewed changes

review fixes

5e00564

AntonStasheuski force-pushed the issue-92-vllm-vs-nekko-perf-test branch from 801838a to b169191 Compare February 18, 2025 18:47

feat(TGI): Add tgi support

5435b83

AntonStasheuski force-pushed the issue-92-vllm-vs-nekko-perf-test branch from b169191 to 5435b83 Compare February 18, 2025 19:25

vidas approved these changes Feb 19, 2025

View reviewed changes

feat(TGI): Fix prompt sending && change model to be consistent accros…

de53f24

…s all llms

AntonStasheuski force-pushed the issue-92-vllm-vs-nekko-perf-test branch from d08f0cc to de53f24 Compare February 19, 2025 16:59

AntonStasheuski merged commit 44c1c2b into main Feb 21, 2025
1 check passed

akxcv pushed a commit that referenced this pull request Feb 24, 2025

Merge pull request #97 from aifoundry-org/issue-92-vllm-vs-nekko-perf…

0a9501b

…-test Performance benchmarks: Nekko vs vllM vs oLLAMA

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Performance benchmarks: Nekko vs vllM vs oLLAMA#97

Performance benchmarks: Nekko vs vllM vs oLLAMA#97
AntonStasheuski merged 7 commits intomainfrom
issue-92-vllm-vs-nekko-perf-test

AntonStasheuski commented Feb 11, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

vidas left a comment

Uh oh!

Uh oh!

Uh oh!

vidas Feb 17, 2025

Uh oh!

AntonStasheuski Feb 17, 2025

Uh oh!

AntonStasheuski commented Feb 17, 2025

Uh oh!

vidas left a comment

Uh oh!

Uh oh!

vidas Feb 19, 2025

Uh oh!

AntonStasheuski Feb 19, 2025

Uh oh!

AntonStasheuski Feb 19, 2025 •

edited

Loading

Uh oh!

AntonStasheuski commented Feb 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

AntonStasheuski commented Feb 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vidas left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

vidas Feb 17, 2025

Choose a reason for hiding this comment

Uh oh!

AntonStasheuski Feb 17, 2025

Choose a reason for hiding this comment

Uh oh!

AntonStasheuski commented Feb 17, 2025

Uh oh!

vidas left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vidas Feb 19, 2025

Choose a reason for hiding this comment

Uh oh!

AntonStasheuski Feb 19, 2025

Choose a reason for hiding this comment

Uh oh!

AntonStasheuski Feb 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AntonStasheuski commented Feb 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AntonStasheuski commented Feb 11, 2025 •

edited

Loading

AntonStasheuski Feb 19, 2025 •

edited

Loading