-
Notifications
You must be signed in to change notification settings - Fork 11.5k
llama-bench : add test measuring token generation rate at given prompt length #11126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
… given prompt length
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other printers (sql, json, etc) would also need to be updated.
@slaren Can you be more specific? |
The test type needs to be exported in these printers as well, since |
I guess another option is to add a "test" column in all printers with the same values as displayed in default console output. Any specific reason it's not included there? |
Yes, that's what I meant when I said that the test type would need to be exported in these printers. There isn't a test column/field at the moment because it is not necessary. |
switch (test_kind) { | ||
case TEST_KIND_PP: | ||
snprintf(buf, sizeof(buf), "pp%d", n_prompt); | ||
break; | ||
case TEST_KIND_TG: | ||
snprintf(buf, sizeof(buf), "tg%d", n_gen); | ||
break; | ||
case TEST_KIND_PG: | ||
snprintf(buf, sizeof(buf), "pp%d+tg%d", n_prompt, n_gen); | ||
break; | ||
case TEST_KIND_GP: | ||
snprintf(buf, sizeof(buf), "tg%d@pp%d", n_gen, n_prompt); | ||
break; | ||
default: | ||
snprintf(buf, sizeof(buf), "unknown"); | ||
break; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This formatting should only be applied to the markdown printer. The other printers are intended to be used programmatically, so it should be a simple enum that can be parsed easily, without the token counts. The token counts can be obtained from the n_prompt
and n_gen
parameters already.
I needed a test that would measure token generation rate after processing a prompt of given length, so I decided to add a new kind of test to the llama-bench tool.
This PR adds
-gp <pp,tg>
option that allows to specify a prompt length and number of tokens generated after processing the prompt. This new test works almost the same way as old-pg
test, but it doesn't take into account the prompt length and prompt processing time when calculating result, only the token generation rate is reported.Test results are labeled in a different way to avoid confusion with -pg test results, I used @ character to emphasize that the result indicates the token generation rate AT given prompt length.
Example:
$ ./bin/llama-bench --numa distribute -t 32 -m /mnt/md0/models/deepseek-v3-Q4_K_S.gguf -p 0 -n 0 -gp 128,32 -gp 256,32 -r 3
Hopefully this is more intuitive compared to averaged prompt processing + token generation rate in
-pg
test results.