Skip to content

Comments

Performance benchmarks: Nekko vs vllM vs oLLAMA#97

Merged
AntonStasheuski merged 7 commits intomainfrom
issue-92-vllm-vs-nekko-perf-test
Feb 21, 2025
Merged

Performance benchmarks: Nekko vs vllM vs oLLAMA#97
AntonStasheuski merged 7 commits intomainfrom
issue-92-vllm-vs-nekko-perf-test

Conversation

@AntonStasheuski
Copy link
Contributor

@AntonStasheuski AntonStasheuski commented Feb 11, 2025

Summary:
This PR adds a new performance testing framework for LLMs, including configurations and dependencies to benchmark different LLM APIs.

Changes Made:

  • .gitignore:
    Excludes performance test artifacts (benchmarks/results/*, benchmarks/models, benchmarks/build).

  • Dependencies:
    Integrates ollama, vllm and llmperf for benchmarking.

How to Test:

  1. Navigate to the benchmarks/ directory.
  2. Run tests using make.
  3. Review results in the results/ directory.

Additional Notes:

  • Supports high concurrency and multiple scenarios with retry logic.
  • Captures both performance and system metrics for each run.

@AntonStasheuski AntonStasheuski force-pushed the issue-92-vllm-vs-nekko-perf-test branch from 07ac258 to 95529d2 Compare February 11, 2025 14:32
@AntonStasheuski AntonStasheuski force-pushed the issue-92-vllm-vs-nekko-perf-test branch 2 times, most recently from 8acefd2 to a9922d1 Compare February 14, 2025 21:02
@AntonStasheuski AntonStasheuski force-pushed the issue-92-vllm-vs-nekko-perf-test branch from a9922d1 to 9e832a0 Compare February 14, 2025 21:03
@AntonStasheuski AntonStasheuski changed the title Performance tests: Nekko vs vllM Performance tests: Nekko vs vllM vs oLLAMA Feb 14, 2025
@AntonStasheuski AntonStasheuski changed the title Performance tests: Nekko vs vllM vs oLLAMA Performance benchmarks: Nekko vs vllM vs oLLAMA Feb 14, 2025
@AntonStasheuski AntonStasheuski added the enhancement New feature or request label Feb 14, 2025
This was linked to issues Feb 14, 2025
Copy link
Member

@vidas vidas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When looking at results (generated on my machine) there are few obviuos issues:

  • CPU/RAM measurements don't work
  • ollama TTFT is zero (can't be true)
  • there is very strong prompt caching effect (at least for ollama and nekko), as the first request is much faster.

Summary:

System Info,Value
CPU Cores,16
Total RAM (MB),15206
OS Version,Linux 6.12.10-arch1-1
Architecture,x86_64

Metric,Scenario,Nekko,Ollama,Vllm
Throughput (Tokens/sec),High concurrency,6.955930550172383,30.228510601085667,16.598440030770078
Throughput (Tokens/sec),Long prompt output,11.422343648611303,26.13431460304411,28.17629839704017
Throughput (Tokens/sec),Medium prompt output,10.98778052645897,25.855812022353348,24.225328046311127
Throughput (Tokens/sec),Short prompt output,11.529355170327143,27.33676320612099,25.404445179090807
Latency (Time to First Token - ms),High concurrency,456.61954302340746,0.0,161.29064094275236
Latency (Time to First Token - ms),Long prompt output,98.13516697613522,0.0,59.875387989450246
Latency (Time to First Token - ms),Medium prompt output,100.54681697511114,0.0,85.24371474049985
Latency (Time to First Token - ms),Short prompt output,215.6578809954226,0.0,72.28620845125988
Latency (Time to Complete Response - ms),High concurrency,1437.6221740385517,33.081352012231946,1385.6723859207705
Latency (Time to Complete Response - ms),Long prompt output,1901.5681465389207,40.64613702939823,3235.065012704581
Latency (Time to Complete Response - ms),Medium prompt output,2769.0693304757588,39.86884403275326,1788.4064214886166
Latency (Time to Complete Response - ms),Short prompt output,1884.335668233689,3755.463357985718,945.0819717312697
CPU Usage (%),High concurrency,N/A,N/A,N/A
CPU Usage (%),Long prompt output,N/A,N/A,N/A
CPU Usage (%),Medium prompt output,N/A,N/A,N/A
CPU Usage (%),Short prompt output,N/A,N/A,N/A
RAM Usage (GB),High concurrency,N/A,N/A,N/A
RAM Usage (GB),Long prompt output,N/A,N/A,N/A
RAM Usage (GB),Medium prompt output,N/A,N/A,N/A
RAM Usage (GB),Short prompt output,N/A,N/A,N/A

Potential hint to ollama problem:
image

Please attach full benchmarking results on your machine for comparison, as this may be architecture/config related.

@@ -0,0 +1,19 @@
receivers:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need otel during benchmarking?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this to display nekko logs when we run benchmarking

@AntonStasheuski
Copy link
Contributor Author

hey @vidas i missed a file name, ouput with updated veersion:

System Info,Value
CPU Cores,8
Total RAM (MB),31955
OS Version,Linux 6.8.0-52-generic
Architecture,x86_64

Metric,Scenario,Nekko,Ollama,Vllm
Throughput (Tokens/sec),High concurrency,14.060101525569172,57.8191161333023,42.90549281716836
Throughput (Tokens/sec),Long prompt output,13.76369740541185,77.82620009576904,58.56474042768487
Throughput (Tokens/sec),Medium prompt output,13.705738683833564,82.38649893336279,61.806509350824285
Throughput (Tokens/sec),Short prompt output,12.642000271562939,61.960497246750236,59.10243276592215
Latency (Time to First Token - ms),High concurrency,94.22089700001379,0.0,46.67354399998658
Latency (Time to First Token - ms),Long prompt output,77.07396300000369,0.0,30.425740749990382
Latency (Time to First Token - ms),Medium prompt output,77.75769825002499,0.0,28.08553100004474
Latency (Time to First Token - ms),Short prompt output,237.4161849999723,0.0,36.41761075004979
Latency (Time to Complete Response - ms),High concurrency,1493.5880770000267,17.29531799992401,536.0618999998223
Latency (Time to Complete Response - ms),Long prompt output,1879.9521374999983,12.941063749963178,1389.5126047499957
Latency (Time to Complete Response - ms),Medium prompt output,2455.7411917500076,12.140223000017158,896.9998055000019
Latency (Time to Complete Response - ms),Short prompt output,1452.324286999982,394.02845624999827,398.3472177500289
CPU Usage (%),High concurrency,48.339987452948556,0.6939874213836478,15.11957286432161
CPU Usage (%),Long prompt output,50.40191194968553,0.6680880503144654,50.59575909661229
CPU Usage (%),Medium prompt output,50.55818411097099,2.7304568332685677,50.649170854271354
CPU Usage (%),Short prompt output,51.251984924623116,2.6900628140703517,48.042090680100756
RAM Usage (GB),High concurrency,0.4542884826660156,2.6390228271484375,9.412818908691406
RAM Usage (GB),Long prompt output,0.4504966735839844,2.6387100219726562,9.408592224121094
RAM Usage (GB),Medium prompt output,0.45035552978515625,2.638378143310547,9.408561706542969
RAM Usage (GB),Short prompt output,0.4385948181152344,2.6380615234375,9.408233642578125

@AntonStasheuski AntonStasheuski force-pushed the issue-92-vllm-vs-nekko-perf-test branch from 801838a to b169191 Compare February 18, 2025 18:47
@AntonStasheuski AntonStasheuski force-pushed the issue-92-vllm-vs-nekko-perf-test branch from b169191 to 5435b83 Compare February 18, 2025 19:25
Copy link
Member

@vidas vidas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still some issues with ollama - latency (both TTFT and time to complete response) is unrealistic. Otherwise pretty nice.

command: [
# If you have more than 10 cores, it may request too much RAM. Uncomment it
# "--max-batch-size", "1",
# "--max-total-tokens", "8192",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is never needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tgi_api | 2025-02-19T15:25:19.154557Z INFO llamacpp: backends/llamacpp/src/backend.rs:216: llama_init_from_model: n_ctx = 2048

by default we have 2048 but nekko config contains "n_ctx": 8192

Copy link
Contributor Author

@AntonStasheuski AntonStasheuski Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2025-02-19T15:25:19.154557Z INFO llamacpp: backends/llamacpp/src/backend.rs:216: llama_init_from_model: n_ctx = 2048

by default we have 2048 but nekko config contains "n_ctx": 8192

@AntonStasheuski
Copy link
Contributor Author

@vidas thanks for review, will check ollama (TTFT and time to complete response) i believe it's related to some custom field names

@AntonStasheuski AntonStasheuski force-pushed the issue-92-vllm-vs-nekko-perf-test branch from d08f0cc to de53f24 Compare February 19, 2025 16:59
@AntonStasheuski AntonStasheuski merged commit 44c1c2b into main Feb 21, 2025
1 check passed
akxcv pushed a commit that referenced this pull request Feb 24, 2025
…-test

Performance benchmarks: Nekko vs vllM vs oLLAMA
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Nekko vs Ollama Nekko vs vLLM Nekko vs TGI

2 participants