scripts : add script to bench models #16894

ggerganov · 2025-10-31T11:48:39Z

Benchmark automation for multiple models as done in #16578.

The script downloads models from HF
The script should be started from a build folder
Use --quick for a quick pass using shorter benches
Add extra models to be benched in models-extra.txt in the current folder

Sample usage and output:

bash ../scripts/bench-models.sh --quick

Output

ggml-org/gpt-oss-20b-GGUF

Model: https://huggingface.co/ggml-org/gpt-oss-20b-GGUF

llama-batched-bench

main: n_kv_max = 20480, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, is_pp_shared = 0, n_gpu_layers = -1, n_threads = 16, n_threads_batch = 16

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
512	32	1	544	0.211	2428.42	0.241	132.94	0.452	1204.73
512	32	2	1088	0.374	2738.91	0.364	175.77	0.738	1474.29
512	32	4	2176	0.707	2897.74	0.583	219.38	1.290	1686.54
4096	32	1	4128	1.502	2727.05	0.260	122.92	1.762	2342.38
4096	32	2	8256	2.981	2748.23	0.397	161.15	3.378	2444.06
4096	32	4	16512	5.944	2756.51	0.647	197.71	6.591	2505.18

llama-bench

model	size	params	backend	threads	n_ubatch	fa	mmap	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Metal,BLAS	16	2048	1	0	pp2048	2797.52 ± 5.23
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	Metal,BLAS	16	2048	1	0	tg32	134.10 ± 0.21

build: 6342c21c0 (6901)

ggml-org/gpt-oss-120b-GGUF

Model: https://huggingface.co/ggml-org/gpt-oss-120b-GGUF

llama-batched-bench

main: n_kv_max = 20480, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, is_pp_shared = 0, n_gpu_layers = -1, n_threads = 16, n_threads_batch = 16

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
512	32	1	544	0.414	1236.56	0.357	89.62	0.771	705.46
512	32	2	1088	0.666	1536.69	0.527	121.48	1.193	911.83
512	32	4	2176	1.172	1747.03	0.826	154.87	1.999	1088.67
4096	32	1	4128	2.483	1649.67	0.385	83.09	2.868	1439.31
4096	32	2	8256	4.949	1655.45	0.572	111.79	5.521	1495.38
4096	32	4	16512	9.872	1659.65	0.916	139.78	10.788	1530.64

llama-bench

model	size	params	backend	threads	n_ubatch	fa	mmap	test	t/s
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	Metal,BLAS	16	2048	1	0	pp2048	1688.86 ± 2.91
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	Metal,BLAS	16	2048	1	0	tg32	90.15 ± 0.24

build: 6342c21c0 (6901)

ggml-org/Qwen3-Coder-30B-A3B-Instruct-Q8_0-GGUF

Model: https://huggingface.co/ggml-org/Qwen3-Coder-30B-A3B-Instruct-Q8_0-GGUF

llama-batched-bench

main: n_kv_max = 20480, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, is_pp_shared = 0, n_gpu_layers = -1, n_threads = 16, n_threads_batch = 16

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
512	32	1	544	0.235	2181.59	0.393	81.45	0.628	866.85
512	32	2	1088	0.396	2584.34	0.550	116.37	0.946	1149.87
512	32	4	2176	0.729	2810.49	0.815	157.04	1.544	1409.54
4096	32	1	4128	1.825	2244.76	0.440	72.79	2.264	1823.07
4096	32	2	8256	3.633	2254.65	0.633	101.18	4.266	1935.34
4096	32	4	16512	7.255	2258.30	0.970	131.96	8.225	2007.53

llama-bench

model	size	params	backend	threads	n_ubatch	fa	mmap	test	t/s
qwen3moe 30B.A3B Q8_0	30.25 GiB	30.53 B	Metal,BLAS	16	2048	1	0	pp2048	2527.20 ± 4.55
qwen3moe 30B.A3B Q8_0	30.25 GiB	30.53 B	Metal,BLAS	16	2048	1	0	tg32	82.17 ± 0.07

build: 6342c21c0 (6901)

ggml-org/Qwen2.5-Coder-7B-Q8_0-GGUF

Model: https://huggingface.co/ggml-org/Qwen2.5-Coder-7B-Q8_0-GGUF

llama-batched-bench

main: n_kv_max = 20480, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, is_pp_shared = 0, n_gpu_layers = -1, n_threads = 16, n_threads_batch = 16

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
512	32	1	544	0.337	1518.90	0.405	79.04	0.742	733.22
512	32	2	1088	0.639	1603.62	0.469	136.37	1.108	982.06
512	32	4	2176	1.248	1640.90	0.570	224.52	1.818	1196.79
4096	32	1	4128	2.684	1526.12	0.428	74.77	3.112	1326.52
4096	32	2	8256	5.354	1530.08	0.514	124.48	5.868	1406.93
4096	32	4	16512	10.694	1532.06	0.655	195.49	11.349	1454.95

llama-bench

model	size	params	backend	threads	n_ubatch	fa	mmap	test	t/s
qwen2 7B Q8_0	7.54 GiB	7.62 B	Metal,BLAS	16	2048	1	0	pp2048	1584.53 ± 1.50
qwen2 7B Q8_0	7.54 GiB	7.62 B	Metal,BLAS	16	2048	1	0	tg32	78.82 ± 0.10

build: 6342c21c0 (6901)

ggml-org/gemma-3-4b-it-qat-GGUF

Model: https://huggingface.co/ggml-org/gemma-3-4b-it-qat-GGUF

llama-batched-bench

main: n_kv_max = 20480, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, is_pp_shared = 0, n_gpu_layers = -1, n_threads = 16, n_threads_batch = 16

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
512	32	1	544	0.185	2761.91	0.236	135.31	0.422	1289.51
512	32	2	1088	0.343	2983.54	0.303	211.46	0.646	1684.54
512	32	4	2176	0.661	3098.04	0.397	322.22	1.058	2056.11
4096	32	1	4128	1.396	2933.16	0.249	128.41	1.646	2508.43
4096	32	2	8256	2.766	2962.07	0.320	199.73	3.086	2675.25
4096	32	4	16512	5.517	2969.71	0.445	287.38	5.962	2769.33

llama-bench

model	size	params	backend	threads	n_ubatch	fa	mmap	test	t/s
gemma3 4B Q4_0	2.35 GiB	3.88 B	Metal,BLAS	16	2048	1	0	pp2048	2941.75 ± 14.57
gemma3 4B Q4_0	2.35 GiB	3.88 B	Metal,BLAS	16	2048	1	0	tg32	135.36 ± 0.20

build: 6342c21c0 (6901)

scripts : add script to bench models

0aa48d8

github-actions bot added script Script related examples labels Oct 31, 2025

ggerganov merged commit 7fd205a into master Nov 1, 2025
70 of 72 checks passed

ggerganov deleted the gg/script-bench-models branch November 1, 2025 22:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

scripts : add script to bench models #16894

scripts : add script to bench models #16894

ggerganov commented Oct 31, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

scripts : add script to bench models #16894

scripts : add script to bench models #16894

Conversation

ggerganov commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ggml-org/gpt-oss-20b-GGUF

ggml-org/gpt-oss-120b-GGUF

ggml-org/Qwen3-Coder-30B-A3B-Instruct-Q8_0-GGUF

ggml-org/Qwen2.5-Coder-7B-Q8_0-GGUF

ggml-org/gemma-3-4b-it-qat-GGUF

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ggerganov commented Oct 31, 2025 •

edited

Loading