Skip to content

Conversation

@ggerganov
Copy link
Member

@ggerganov ggerganov commented Oct 31, 2025

Benchmark automation for multiple models as done in #16578.

  • The script downloads models from HF
  • The script should be started from a build folder
  • Use --quick for a quick pass using shorter benches
  • Add extra models to be benched in models-extra.txt in the current folder

Sample usage and output:

bash ../scripts/bench-models.sh --quick
Output

ggml-org/gpt-oss-20b-GGUF

Model: https://huggingface.co/ggml-org/gpt-oss-20b-GGUF

  • llama-batched-bench

main: n_kv_max = 20480, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, is_pp_shared = 0, n_gpu_layers = -1, n_threads = 16, n_threads_batch = 16

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
512 32 1 544 0.211 2428.42 0.241 132.94 0.452 1204.73
512 32 2 1088 0.374 2738.91 0.364 175.77 0.738 1474.29
512 32 4 2176 0.707 2897.74 0.583 219.38 1.290 1686.54
4096 32 1 4128 1.502 2727.05 0.260 122.92 1.762 2342.38
4096 32 2 8256 2.981 2748.23 0.397 161.15 3.378 2444.06
4096 32 4 16512 5.944 2756.51 0.647 197.71 6.591 2505.18
  • llama-bench
model size params backend threads n_ubatch fa mmap test t/s
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Metal,BLAS 16 2048 1 0 pp2048 2797.52 ± 5.23
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B Metal,BLAS 16 2048 1 0 tg32 134.10 ± 0.21

build: 6342c21c0 (6901)

ggml-org/gpt-oss-120b-GGUF

Model: https://huggingface.co/ggml-org/gpt-oss-120b-GGUF

  • llama-batched-bench

main: n_kv_max = 20480, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, is_pp_shared = 0, n_gpu_layers = -1, n_threads = 16, n_threads_batch = 16

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
512 32 1 544 0.414 1236.56 0.357 89.62 0.771 705.46
512 32 2 1088 0.666 1536.69 0.527 121.48 1.193 911.83
512 32 4 2176 1.172 1747.03 0.826 154.87 1.999 1088.67
4096 32 1 4128 2.483 1649.67 0.385 83.09 2.868 1439.31
4096 32 2 8256 4.949 1655.45 0.572 111.79 5.521 1495.38
4096 32 4 16512 9.872 1659.65 0.916 139.78 10.788 1530.64
  • llama-bench
model size params backend threads n_ubatch fa mmap test t/s
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B Metal,BLAS 16 2048 1 0 pp2048 1688.86 ± 2.91
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B Metal,BLAS 16 2048 1 0 tg32 90.15 ± 0.24

build: 6342c21c0 (6901)

ggml-org/Qwen3-Coder-30B-A3B-Instruct-Q8_0-GGUF

Model: https://huggingface.co/ggml-org/Qwen3-Coder-30B-A3B-Instruct-Q8_0-GGUF

  • llama-batched-bench

main: n_kv_max = 20480, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, is_pp_shared = 0, n_gpu_layers = -1, n_threads = 16, n_threads_batch = 16

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
512 32 1 544 0.235 2181.59 0.393 81.45 0.628 866.85
512 32 2 1088 0.396 2584.34 0.550 116.37 0.946 1149.87
512 32 4 2176 0.729 2810.49 0.815 157.04 1.544 1409.54
4096 32 1 4128 1.825 2244.76 0.440 72.79 2.264 1823.07
4096 32 2 8256 3.633 2254.65 0.633 101.18 4.266 1935.34
4096 32 4 16512 7.255 2258.30 0.970 131.96 8.225 2007.53
  • llama-bench
model size params backend threads n_ubatch fa mmap test t/s
qwen3moe 30B.A3B Q8_0 30.25 GiB 30.53 B Metal,BLAS 16 2048 1 0 pp2048 2527.20 ± 4.55
qwen3moe 30B.A3B Q8_0 30.25 GiB 30.53 B Metal,BLAS 16 2048 1 0 tg32 82.17 ± 0.07

build: 6342c21c0 (6901)

ggml-org/Qwen2.5-Coder-7B-Q8_0-GGUF

Model: https://huggingface.co/ggml-org/Qwen2.5-Coder-7B-Q8_0-GGUF

  • llama-batched-bench

main: n_kv_max = 20480, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, is_pp_shared = 0, n_gpu_layers = -1, n_threads = 16, n_threads_batch = 16

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
512 32 1 544 0.337 1518.90 0.405 79.04 0.742 733.22
512 32 2 1088 0.639 1603.62 0.469 136.37 1.108 982.06
512 32 4 2176 1.248 1640.90 0.570 224.52 1.818 1196.79
4096 32 1 4128 2.684 1526.12 0.428 74.77 3.112 1326.52
4096 32 2 8256 5.354 1530.08 0.514 124.48 5.868 1406.93
4096 32 4 16512 10.694 1532.06 0.655 195.49 11.349 1454.95
  • llama-bench
model size params backend threads n_ubatch fa mmap test t/s
qwen2 7B Q8_0 7.54 GiB 7.62 B Metal,BLAS 16 2048 1 0 pp2048 1584.53 ± 1.50
qwen2 7B Q8_0 7.54 GiB 7.62 B Metal,BLAS 16 2048 1 0 tg32 78.82 ± 0.10

build: 6342c21c0 (6901)

ggml-org/gemma-3-4b-it-qat-GGUF

Model: https://huggingface.co/ggml-org/gemma-3-4b-it-qat-GGUF

  • llama-batched-bench

main: n_kv_max = 20480, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, is_pp_shared = 0, n_gpu_layers = -1, n_threads = 16, n_threads_batch = 16

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
512 32 1 544 0.185 2761.91 0.236 135.31 0.422 1289.51
512 32 2 1088 0.343 2983.54 0.303 211.46 0.646 1684.54
512 32 4 2176 0.661 3098.04 0.397 322.22 1.058 2056.11
4096 32 1 4128 1.396 2933.16 0.249 128.41 1.646 2508.43
4096 32 2 8256 2.766 2962.07 0.320 199.73 3.086 2675.25
4096 32 4 16512 5.517 2969.71 0.445 287.38 5.962 2769.33
  • llama-bench
model size params backend threads n_ubatch fa mmap test t/s
gemma3 4B Q4_0 2.35 GiB 3.88 B Metal,BLAS 16 2048 1 0 pp2048 2941.75 ± 14.57
gemma3 4B Q4_0 2.35 GiB 3.88 B Metal,BLAS 16 2048 1 0 tg32 135.36 ± 0.20

build: 6342c21c0 (6901)

@github-actions github-actions bot added script Script related examples labels Oct 31, 2025
@ggerganov ggerganov merged commit 7fd205a into master Nov 1, 2025
70 of 72 checks passed
@ggerganov ggerganov deleted the gg/script-bench-models branch November 1, 2025 22:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples script Script related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants