Skip to content

tests : enhance llama-bench with separate timings (pp/gen t/s), added n_threads_batch #14219

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

thad0ctor
Copy link

@thad0ctor thad0ctor commented Jun 16, 2025

  • added gen t/s and pp t/s outputs to lamma-bench

  • n-theads-batch args to llama-bench

    Minor improvments to llama-bench

    New Features

    1. Separate Prompt/Generation Timing: Provides detailed performance metrics by separately measuring prompt processing and token generation.
    2. n_threads_batch: Add n_threads_batch to available commands

Example output:

bash -c './bin/llama-bench -m ../models/test-model.gguf -p 128 -n 128 -t 2,4 --n-threads-batch 2,4'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
  Device 1: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
  Device 2: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
| model                          |       size |     params | backend    | ngl | threads | th_batch |            test |                  t/s |               pp t/s |               tg t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | --------------: | -------------------: | -------------------: | -------------------: |
| llama ?B Q4_0                  |  17.50 MiB |    24.41 M | CUDA       |  99 |       2 |        2 |           pp128 |   88578.06 ± 3581.02 |   88582.08 ± 3580.75 |                   N/A |
| llama ?B Q4_0                  |  17.50 MiB |    24.41 M | CUDA       |  99 |       2 |        2 |           tg128 |      3168.12 ± 11.97 |                   N/A |      3168.13 ± 11.96 |
| llama ?B Q4_0                  |  17.50 MiB |    24.41 M | CUDA       |  99 |       2 |        4 |           pp128 |    90262.38 ± 507.55 |    90266.08 ± 507.92 |                   N/A |
| llama ?B Q4_0                  |  17.50 MiB |    24.41 M | CUDA       |  99 |       2 |        4 |           tg128 |      3050.96 ± 52.88 |                   N/A |      3050.97 ± 52.89 |
| llama ?B Q4_0                  |  17.50 MiB |    24.41 M | CUDA       |  99 |       4 |        2 |           pp128 |    90142.95 ± 685.34 |    90146.65 ± 685.78 |                   N/A |
| llama ?B Q4_0                  |  17.50 MiB |    24.41 M | CUDA       |  99 |       4 |        2 |           tg128 |      3075.67 ± 37.75 |                   N/A |      3075.68 ± 37.75 |
| llama ?B Q4_0                  |  17.50 MiB |    24.41 M | CUDA       |  99 |       4 |        4 |           pp128 |   89512.40 ± 1155.35 |   89515.78 ± 1155.44 |                   N/A |
| llama ?B Q4_0                  |  17.50 MiB |    24.41 M | CUDA       |  99 |       4 |        4 |           tg128 |      3025.98 ± 45.42 |                   N/A |      3025.99 ± 45.42 |

added gen t/s and pp t/s outputs, n-theads-batch to llama-bench
@ericcurtin ericcurtin requested a review from Copilot June 16, 2025 21:29
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds separate timing measurements for prompt processing and token generation in llama-bench and introduces a new command‑line argument (n_threads_batch) for batch thread specification.

  • Added a new parameter “n_threads_batch” across configuration, parsing, test execution, and output formatting.
  • Integrated separate metrics for prompt and generation timing (samples_prompt_ns, samples_gen_ns) and updated the markdown and SQL printers to display the new metrics.
Comments suppressed due to low confidence (2)

tools/llama-bench/llama-bench.cpp:1617

  • [nitpick] Mapping the field 'n_threads_batch' to the alias 'th_batch' might be unclear; consider either using a more descriptive alias or adding an inline comment to explain the abbreviation.
if (field == "n_threads_batch") {

tools/llama-bench/llama-bench.cpp:2070

  • [nitpick] Although separate timing measurements for prompt and generation are implemented, adding inline comments to explain the timing logic can improve clarity for future maintainers.
uint64_t t_start = get_time_ns() - t_start;

Comment on lines +1249 to +1260
if (samples_prompt_ns.empty() || n_prompt == 0) return {};
std::vector<double> ts;
std::transform(samples_prompt_ns.begin(), samples_prompt_ns.end(), std::back_inserter(ts),
[this](uint64_t t) { return 1e9 * n_prompt / t; });
return ts;
}

std::vector<double> get_gen_ts() const {
if (samples_gen_ns.empty() || n_gen == 0) return {};
std::vector<double> ts;
std::transform(samples_gen_ns.begin(), samples_gen_ns.end(), std::back_inserter(ts),
[this](uint64_t t) { return 1e9 * n_gen / t; });
Copy link
Preview

Copilot AI Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The get_prompt_ts() and get_gen_ts() functions contain very similar code. Consider extracting a common helper function to reduce duplication.

Suggested change
if (samples_prompt_ns.empty() || n_prompt == 0) return {};
std::vector<double> ts;
std::transform(samples_prompt_ns.begin(), samples_prompt_ns.end(), std::back_inserter(ts),
[this](uint64_t t) { return 1e9 * n_prompt / t; });
return ts;
}
std::vector<double> get_gen_ts() const {
if (samples_gen_ns.empty() || n_gen == 0) return {};
std::vector<double> ts;
std::transform(samples_gen_ns.begin(), samples_gen_ns.end(), std::back_inserter(ts),
[this](uint64_t t) { return 1e9 * n_gen / t; });
return get_ts_helper(samples_prompt_ns, n_prompt);
}
std::vector<double> get_gen_ts() const {
return get_ts_helper(samples_gen_ns, n_gen);
}
private:
std::vector<double> get_ts_helper(const std::vector<uint64_t>& samples, int n) const {
if (samples.empty() || n == 0) return {};
std::vector<double> ts;
std::transform(samples.begin(), samples.end(), std::back_inserter(ts),
[n](uint64_t t) { return 1e9 * n / t; });

Copilot uses AI. Check for mistakes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional, possibly for a follow on PR

@slaren
Copy link
Member

slaren commented Jun 17, 2025

llama-bench does not have a n_threads_batch parameter intentionally. Having a single parameter makes it clear the number of threads that was used for the test, and removes an additional column that does not have any useful information. I also don't think it is desirable to have separate columns for pp and tg results, since it makes the output harder to read and too wide to fit in the terminal or in a github comment.

@thad0ctor
Copy link
Author

llama-bench does not have a n_threads_batch parameter intentionally. Having a single parameter makes it clear the number of threads that was used for the test, and removes an additional column that does not have any useful information. I also don't think it is not desirable to have separate columns for pp and tg results, since it makes the output harder to read and too wide to fit in the terminal or in a github comment.

n-threads-batch:

I think we have two different schools of thought, I view llama-bench as a tool to get detailed information to fine tune performance of your model for a certain system, model, multi-model server, workflow, etc. As such, more parameters than can give the user insight into their performance are a value-add. If you are worried about this causing confusion I can update the code to only show the column when the parameter was passed.

pp/gen t/s:

similarly, this is an effective data point and (if such a thing exists) an industry standard when it comes to reviewing bench performance of new models, quants etc. this is an incredibly valuable data point for accessing models, workflows etc.

if you are that worried I can add functionality to hide the extra token/s columns with another parameter. I view these as elementary data points though and standard information provided in all backends when interfacing

@slaren
Copy link
Member

slaren commented Jun 17, 2025

I am sorry, but I don't see the point of any of these changes. Users can test different thread numbers, and when running llama.cpp normally they can use the number of threads that performed the best with generation with --threads, and the number of threads that performed the best with prompt processing with --threads-batch. Having different parameters in llama-bench does not add any information, and if anything makes it harder to test pp and tg at the same time with different numbers of threads, since the number of combinations increases dramatically. I don't see any point to separating the results of pp and tg. If you want to process the results of llama-bench with another application, you should use a different output format such as json.

@thad0ctor
Copy link
Author

thad0ctor commented Jun 18, 2025

I am sorry, but I don't see the point of any of these changes. Users can test different thread numbers, and when running llama.cpp normally they can use the number of threads that performed the best with generation with --threads, and the number of threads that performed the best with prompt processing with --threads-batch. Having different parameters in llama-bench does not add any information, and if anything makes it harder to test pp and tg at the same time with different numbers of threads, since the number of combinations increases dramatically. I don't see any point to separating the results of pp and tg. If you want to process the results of llama-bench with another application, you should use a different output format such as json.

--threads-batch is not a llama-bench parameter though

./llama-bench --model /Models/lmstudio-community/Mistral-Small-3.1-24B-Instruct-2503-GGUF/Mistral-Small-3.1-24B-Instruct-2503-Q8_0.gguf --n-prompt 1000 --n-gen 0 --batch-size 1 --n-gpu-layers 99 --tensor-split 3,3,3 --flash-attn 1 --threads 4-24+4 --output md **--threads-batch**
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
  Device 1: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
  Device 2: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
load_backend: loaded CUDA backend from /home/rgilbreth/Desktop/llama.cpp/build/bin/libggml-cuda.so
load_backend: loaded CPU backend from /home/rgilbreth/Desktop/llama.cpp/build/bin/libggml-cpu-icelake.so
**error: invalid parameter for argument: --threads-batch**

The use case may not appeal too you but, adding this allows users to benchmark a certain paramater - this is a benchmarking tool afterall. If someone wants to explore the interelation between thread-batch and other settings, have at it, these are parameters that work with llama-cli or sever so why not be able to test them. The earlier output I showed does show minor differences in performance when combining various threads/threads-batch combinations.

Regarding you last comment about pp/tg, how can one process the data differently if the default output format into json, md, etc. doesn't provide the level of fidelity for a user to even estimate pp/tg t/s with a script? see the default output format - you can only infer pp t/s if -gen is 0:

[
  {
    "build_commit": "fb85a288",
    "build_number": 5662,
    "cpu_info": "AMD Ryzen Threadripper PRO 7965WX 24-Cores",
    "gpu_info": "NVIDIA GeForce RTX 5090, NVIDIA GeForce RTX 5090, NVIDIA GeForce RTX 5090",
    "backends": "CUDA",
    "model_filename": "/Qwen3-32B-GGUF/Qwen3-32B-Q6_K.gguf",
    "model_type": "qwen3 32B Q6_K",
    "model_size": 26877331456,
    "model_n_params": 32762123264,
    "n_batch": 1,
    "n_ubatch": 512,
    "n_threads": 4,
    "cpu_mask": "0x0",
    "cpu_strict": false,
    "poll": 50,
    "type_k": "f16",
    "type_v": "f16",
    "n_gpu_layers": 99,
    "split_mode": "layer",
    "main_gpu": 0,
    "no_kv_offload": false,
    "flash_attn": true,
    "tensor_split": "3.00",
    "tensor_buft_overrides": "none",
    "defrag_thold": -1.000000,
    "use_mmap": true,
    "embeddings": false,
    "no_op_offload": 0,
    "n_prompt": 1000,
    "n_gen": 0,
    "n_depth": 0,
    "test_time": "2025-06-18T00:41:59Z",
    "avg_ns": 22166922142,
    "stddev_ns": 169917563,
    "avg_ts": 45.114362,
    "stddev_ts": 0.342371,
    "samples_ns": [ 22469591638, 22117340674, 22080551147, 22077929320, 22089197932 ],
    "samples_ts": [ 44.5046, 45.2134, 45.2887, 45.2941, 45.271 ]
  }

You may not see the utltity in this functionality but if you look at enough models online you see users routinely measuring benchmark performance in gen and pp t/s. Averaging these into one parameter limits one's ability to refine settings, model selection, etc to a certain workflow/use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants