llama-bench : add test measuring token generation rate at given prompt length #11126

fairydreaming · 2025-01-07T16:44:21Z

I needed a test that would measure token generation rate after processing a prompt of given length, so I decided to add a new kind of test to the llama-bench tool.

This PR adds -gp <pp,tg> option that allows to specify a prompt length and number of tokens generated after processing the prompt. This new test works almost the same way as old -pg test, but it doesn't take into account the prompt length and prompt processing time when calculating result, only the token generation rate is reported.

Test results are labeled in a different way to avoid confusion with -pg test results, I used @ character to emphasize that the result indicates the token generation rate AT given prompt length.

Example:

$ ./bin/llama-bench --numa distribute -t 32 -m /mnt/md0/models/deepseek-v3-Q4_K_S.gguf -p 0 -n 0 -gp 128,32 -gp 256,32 -r 3

model	size	params	backend	threads	test	t/s
deepseek2 671B Q4_K - Small	353.90 GiB	671.03 B	CPU	32	tg32@pp128	8.94 ± 0.06
deepseek2 671B Q4_K - Small	353.90 GiB	671.03 B	CPU	32	tg32@pp256	8.35 ± 0.01

Hopefully this is more intuitive compared to averaged prompt processing + token generation rate in -pg test results.

… given prompt length

slaren

The other printers (sql, json, etc) would also need to be updated.

fairydreaming · 2025-01-08T13:42:47Z

The other printers (sql, json, etc) would also need to be updated.

@slaren Can you be more specific?

slaren · 2025-01-08T13:47:01Z

The test type needs to be exported in these printers as well, since n_prompt and n_gen is no longer enough to tell the difference. Another option would be to get rid of test_kind_type and just record and report timings for the prompt and generation steps separately, then the -pg test would be enough.

fairydreaming · 2025-01-08T14:05:33Z

The test type needs to be exported in these printers as well, since n_prompt and n_gen is no longer enough to tell the difference. Another option would be to get rid of test_kind_type and just record and report timings for the prompt and generation steps separately, then the -pg test would be enough.

I guess another option is to add a "test" column in all printers with the same values as displayed in default console output. Any specific reason it's not included there?

slaren · 2025-01-08T14:11:16Z

Yes, that's what I meant when I said that the test type would need to be exported in these printers. There isn't a test column/field at the moment because it is not necessary.

slaren · 2025-01-17T00:10:10Z

examples/llama-bench/llama-bench.cpp

+        switch (test_kind) {
+            case TEST_KIND_PP:
+                snprintf(buf, sizeof(buf), "pp%d", n_prompt);
+                break;
+            case TEST_KIND_TG:
+                snprintf(buf, sizeof(buf), "tg%d", n_gen);
+                break;
+            case TEST_KIND_PG:
+                snprintf(buf, sizeof(buf), "pp%d+tg%d", n_prompt, n_gen);
+                break;
+            case TEST_KIND_GP:
+                snprintf(buf, sizeof(buf), "tg%d@pp%d", n_gen, n_prompt);
+                break;
+            default:
+                snprintf(buf, sizeof(buf), "unknown");
+                break;
+        }


This formatting should only be applied to the markdown printer. The other printers are intended to be used programmatically, so it should be a simple enum that can be parsed easily, without the token counts. The token counts can be obtained from the n_prompt and n_gen parameters already.

llama-bench : add -gp <pp,tg> test measuring token generation rate at…

bb6569e

… given prompt length

github-actions bot added the examples label Jan 7, 2025

llama-bench : whitespace formatting

1c69b0e

slaren reviewed Jan 8, 2025

View reviewed changes

llama-bench : add "test" field with test label in all output formats

ae86ff3

fairydreaming requested a review from slaren January 13, 2025 18:14

slaren reviewed Jan 17, 2025

View reviewed changes

fairydreaming mentioned this pull request Jan 27, 2025

Interleave 8 rows (Q8_0, IQ4_XS) ikawrakow/ik_llama.cpp#178

Merged

ikawrakow mentioned this pull request Jan 29, 2025

Various ikawrakow/ik_llama.cpp#181

Merged

city96 mentioned this pull request Feb 1, 2025

Eval bug: -sm row performance on NVidia multy-gpu config is extremely low on the long contexts after b3990 #11510

Closed

fairydreaming mentioned this pull request Feb 4, 2025

NUMA-aware KV cache buffer type (experimental) #11580

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama-bench : add test measuring token generation rate at given prompt length #11126

llama-bench : add test measuring token generation rate at given prompt length #11126

fairydreaming commented Jan 7, 2025

slaren left a comment

fairydreaming commented Jan 8, 2025

slaren commented Jan 8, 2025

fairydreaming commented Jan 8, 2025

slaren commented Jan 8, 2025

slaren Jan 17, 2025

llama-bench : add test measuring token generation rate at given prompt length #11126

Are you sure you want to change the base?

llama-bench : add test measuring token generation rate at given prompt length #11126

Conversation

fairydreaming commented Jan 7, 2025

slaren left a comment

Choose a reason for hiding this comment

fairydreaming commented Jan 8, 2025

slaren commented Jan 8, 2025

fairydreaming commented Jan 8, 2025

slaren commented Jan 8, 2025

slaren Jan 17, 2025

Choose a reason for hiding this comment