tests : enhance llama-bench with separate timings (pp/gen t/s), added n_threads_batch #14219

thad0ctor · 2025-06-16T15:33:30Z

added gen t/s and pp t/s outputs to lamma-bench
n-theads-batch args to llama-bench

Minor improvments to llama-bench

New Features
1. Separate Prompt/Generation Timing: Provides detailed performance metrics by separately measuring prompt processing and token generation.
2. n_threads_batch: Add n_threads_batch to available commands

Example output:

bash -c './bin/llama-bench -m ../models/test-model.gguf -p 128 -n 128 -t 2,4 --n-threads-batch 2,4'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
  Device 1: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
  Device 2: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
| model                          |       size |     params | backend    | ngl | threads | th_batch |            test |                  t/s |               pp t/s |               tg t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | --------------: | -------------------: | -------------------: | -------------------: |
| llama ?B Q4_0                  |  17.50 MiB |    24.41 M | CUDA       |  99 |       2 |        2 |           pp128 |   88578.06 ± 3581.02 |   88582.08 ± 3580.75 |                   N/A |
| llama ?B Q4_0                  |  17.50 MiB |    24.41 M | CUDA       |  99 |       2 |        2 |           tg128 |      3168.12 ± 11.97 |                   N/A |      3168.13 ± 11.96 |
| llama ?B Q4_0                  |  17.50 MiB |    24.41 M | CUDA       |  99 |       2 |        4 |           pp128 |    90262.38 ± 507.55 |    90266.08 ± 507.92 |                   N/A |
| llama ?B Q4_0                  |  17.50 MiB |    24.41 M | CUDA       |  99 |       2 |        4 |           tg128 |      3050.96 ± 52.88 |                   N/A |      3050.97 ± 52.89 |
| llama ?B Q4_0                  |  17.50 MiB |    24.41 M | CUDA       |  99 |       4 |        2 |           pp128 |    90142.95 ± 685.34 |    90146.65 ± 685.78 |                   N/A |
| llama ?B Q4_0                  |  17.50 MiB |    24.41 M | CUDA       |  99 |       4 |        2 |           tg128 |      3075.67 ± 37.75 |                   N/A |      3075.68 ± 37.75 |
| llama ?B Q4_0                  |  17.50 MiB |    24.41 M | CUDA       |  99 |       4 |        4 |           pp128 |   89512.40 ± 1155.35 |   89515.78 ± 1155.44 |                   N/A |
| llama ?B Q4_0                  |  17.50 MiB |    24.41 M | CUDA       |  99 |       4 |        4 |           tg128 |      3025.98 ± 45.42 |                   N/A |      3025.99 ± 45.42 |

added gen t/s and pp t/s outputs, n-theads-batch to llama-bench

Copilot

Pull Request Overview

This PR adds separate timing measurements for prompt processing and token generation in llama-bench and introduces a new command‑line argument (n_threads_batch) for batch thread specification.

Added a new parameter “n_threads_batch” across configuration, parsing, test execution, and output formatting.
Integrated separate metrics for prompt and generation timing (samples_prompt_ns, samples_gen_ns) and updated the markdown and SQL printers to display the new metrics.

Comments suppressed due to low confidence (2)

tools/llama-bench/llama-bench.cpp:1617

[nitpick] Mapping the field 'n_threads_batch' to the alias 'th_batch' might be unclear; consider either using a more descriptive alias or adding an inline comment to explain the abbreviation.

if (field == "n_threads_batch") {

tools/llama-bench/llama-bench.cpp:2070

[nitpick] Although separate timing measurements for prompt and generation are implemented, adding inline comments to explain the timing logic can improve clarity for future maintainers.

uint64_t t_start = get_time_ns() - t_start;

Copilot · 2025-06-16T21:30:31Z

tools/llama-bench/llama-bench.cpp

+        if (samples_prompt_ns.empty() || n_prompt == 0) return {};
+        std::vector<double> ts;
+        std::transform(samples_prompt_ns.begin(), samples_prompt_ns.end(), std::back_inserter(ts),
+                       [this](uint64_t t) { return 1e9 * n_prompt / t; });
+        return ts;
+    }
+
+    std::vector<double> get_gen_ts() const {
+        if (samples_gen_ns.empty() || n_gen == 0) return {};
+        std::vector<double> ts;
+        std::transform(samples_gen_ns.begin(), samples_gen_ns.end(), std::back_inserter(ts),
+                       [this](uint64_t t) { return 1e9 * n_gen / t; });


The get_prompt_ts() and get_gen_ts() functions contain very similar code. Consider extracting a common helper function to reduce duplication.

Suggested change

if (samples_prompt_ns.empty() || n_prompt == 0) return {};

std::vector<double> ts;

std::transform(samples_prompt_ns.begin(), samples_prompt_ns.end(), std::back_inserter(ts),

[this](uint64_t t) { return 1e9 * n_prompt / t; });

return ts;

}

std::vector<double> get_gen_ts() const {

if (samples_gen_ns.empty() || n_gen == 0) return {};

std::vector<double> ts;

std::transform(samples_gen_ns.begin(), samples_gen_ns.end(), std::back_inserter(ts),

[this](uint64_t t) { return 1e9 * n_gen / t; });

return get_ts_helper(samples_prompt_ns, n_prompt);

}

std::vector<double> get_gen_ts() const {

return get_ts_helper(samples_gen_ns, n_gen);

}

private:

std::vector<double> get_ts_helper(const std::vector<uint64_t>& samples, int n) const {

if (samples.empty() || n == 0) return {};

std::vector<double> ts;

std::transform(samples.begin(), samples.end(), std::back_inserter(ts),

[n](uint64_t t) { return 1e9 * n / t; });

Optional, possibly for a follow on PR

slaren · 2025-06-17T09:50:57Z

llama-bench does not have a n_threads_batch parameter intentionally. Having a single parameter makes it clear the number of threads that was used for the test, and removes an additional column that does not have any useful information. I also don't think it is desirable to have separate columns for pp and tg results, since it makes the output harder to read and too wide to fit in the terminal or in a github comment.

thad0ctor · 2025-06-17T12:40:00Z

llama-bench does not have a n_threads_batch parameter intentionally. Having a single parameter makes it clear the number of threads that was used for the test, and removes an additional column that does not have any useful information. I also don't think it is not desirable to have separate columns for pp and tg results, since it makes the output harder to read and too wide to fit in the terminal or in a github comment.

n-threads-batch:

I think we have two different schools of thought, I view llama-bench as a tool to get detailed information to fine tune performance of your model for a certain system, model, multi-model server, workflow, etc. As such, more parameters than can give the user insight into their performance are a value-add. If you are worried about this causing confusion I can update the code to only show the column when the parameter was passed.

pp/gen t/s:

similarly, this is an effective data point and (if such a thing exists) an industry standard when it comes to reviewing bench performance of new models, quants etc. this is an incredibly valuable data point for accessing models, workflows etc.

if you are that worried I can add functionality to hide the extra token/s columns with another parameter. I view these as elementary data points though and standard information provided in all backends when interfacing

slaren · 2025-06-17T13:40:27Z

I am sorry, but I don't see the point of any of these changes. Users can test different thread numbers, and when running llama.cpp normally they can use the number of threads that performed the best with generation with --threads, and the number of threads that performed the best with prompt processing with --threads-batch. Having different parameters in llama-bench does not add any information, and if anything makes it harder to test pp and tg at the same time with different numbers of threads, since the number of combinations increases dramatically. I don't see any point to separating the results of pp and tg. If you want to process the results of llama-bench with another application, you should use a different output format such as json.

thad0ctor · 2025-06-18T00:47:13Z

I am sorry, but I don't see the point of any of these changes. Users can test different thread numbers, and when running llama.cpp normally they can use the number of threads that performed the best with generation with --threads, and the number of threads that performed the best with prompt processing with --threads-batch. Having different parameters in llama-bench does not add any information, and if anything makes it harder to test pp and tg at the same time with different numbers of threads, since the number of combinations increases dramatically. I don't see any point to separating the results of pp and tg. If you want to process the results of llama-bench with another application, you should use a different output format such as json.

--threads-batch is not a llama-bench parameter though

./llama-bench --model /Models/lmstudio-community/Mistral-Small-3.1-24B-Instruct-2503-GGUF/Mistral-Small-3.1-24B-Instruct-2503-Q8_0.gguf --n-prompt 1000 --n-gen 0 --batch-size 1 --n-gpu-layers 99 --tensor-split 3,3,3 --flash-attn 1 --threads 4-24+4 --output md **--threads-batch**
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
  Device 1: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
  Device 2: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
load_backend: loaded CUDA backend from /home/rgilbreth/Desktop/llama.cpp/build/bin/libggml-cuda.so
load_backend: loaded CPU backend from /home/rgilbreth/Desktop/llama.cpp/build/bin/libggml-cpu-icelake.so
**error: invalid parameter for argument: --threads-batch**

The use case may not appeal too you but, adding this allows users to benchmark a certain paramater - this is a benchmarking tool afterall. If someone wants to explore the interelation between thread-batch and other settings, have at it, these are parameters that work with llama-cli or sever so why not be able to test them. The earlier output I showed does show minor differences in performance when combining various threads/threads-batch combinations.

Regarding you last comment about pp/tg, how can one process the data differently if the default output format into json, md, etc. doesn't provide the level of fidelity for a user to even estimate pp/tg t/s with a script? see the default output format - you can only infer pp t/s if -gen is 0:

[
  {
    "build_commit": "fb85a288",
    "build_number": 5662,
    "cpu_info": "AMD Ryzen Threadripper PRO 7965WX 24-Cores",
    "gpu_info": "NVIDIA GeForce RTX 5090, NVIDIA GeForce RTX 5090, NVIDIA GeForce RTX 5090",
    "backends": "CUDA",
    "model_filename": "/Qwen3-32B-GGUF/Qwen3-32B-Q6_K.gguf",
    "model_type": "qwen3 32B Q6_K",
    "model_size": 26877331456,
    "model_n_params": 32762123264,
    "n_batch": 1,
    "n_ubatch": 512,
    "n_threads": 4,
    "cpu_mask": "0x0",
    "cpu_strict": false,
    "poll": 50,
    "type_k": "f16",
    "type_v": "f16",
    "n_gpu_layers": 99,
    "split_mode": "layer",
    "main_gpu": 0,
    "no_kv_offload": false,
    "flash_attn": true,
    "tensor_split": "3.00",
    "tensor_buft_overrides": "none",
    "defrag_thold": -1.000000,
    "use_mmap": true,
    "embeddings": false,
    "no_op_offload": 0,
    "n_prompt": 1000,
    "n_gen": 0,
    "n_depth": 0,
    "test_time": "2025-06-18T00:41:59Z",
    "avg_ns": 22166922142,
    "stddev_ns": 169917563,
    "avg_ts": 45.114362,
    "stddev_ts": 0.342371,
    "samples_ns": [ 22469591638, 22117340674, 22080551147, 22077929320, 22089197932 ],
    "samples_ts": [ 44.5046, 45.2134, 45.2887, 45.2941, 45.271 ]
  }

You may not see the utltity in this functionality but if you look at enough models online you see users routinely measuring benchmark performance in gen and pp t/s. Averaging these into one parameter limits one's ability to refine settings, model selection, etc to a certain workflow/use case.

llama-bench, gen t/s and pp t/s outputs, n-theads-batch

4857ef6

added gen t/s and pp t/s outputs, n-theads-batch to llama-bench

github-actions bot added the examples label Jun 16, 2025

Update llama-bench.cpp

8fa8f1f

ericcurtin approved these changes Jun 16, 2025

View reviewed changes

ericcurtin requested a review from Copilot June 16, 2025 21:29

Copilot AI reviewed Jun 16, 2025

View reviewed changes

Update llama-bench.cpp

8ba9830

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

tests : enhance llama-bench with separate timings (pp/gen t/s), added n_threads_batch #14219

tests : enhance llama-bench with separate timings (pp/gen t/s), added n_threads_batch #14219

thad0ctor commented Jun 16, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jun 16, 2025

Uh oh!

ericcurtin Jun 16, 2025

Uh oh!

slaren commented Jun 17, 2025 •

edited

Loading

Uh oh!

thad0ctor commented Jun 17, 2025

Uh oh!

slaren commented Jun 17, 2025 •

edited

Loading

Uh oh!

thad0ctor commented Jun 18, 2025 •

edited

Loading

Uh oh!

Uh oh!

tests : enhance llama-bench with separate timings (pp/gen t/s), added n_threads_batch #14219

Are you sure you want to change the base?

tests : enhance llama-bench with separate timings (pp/gen t/s), added n_threads_batch #14219

Conversation

thad0ctor commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Minor improvments to llama-bench

New Features

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Jun 16, 2025

Choose a reason for hiding this comment

Uh oh!

ericcurtin Jun 16, 2025

Choose a reason for hiding this comment

Uh oh!

slaren commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thad0ctor commented Jun 17, 2025

Uh oh!

slaren commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thad0ctor commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

thad0ctor commented Jun 16, 2025 •

edited

Loading

slaren commented Jun 17, 2025 •

edited

Loading

slaren commented Jun 17, 2025 •

edited

Loading

thad0ctor commented Jun 18, 2025 •

edited

Loading