-
Notifications
You must be signed in to change notification settings - Fork 12.2k
tests : enhance llama-bench with separate timings (pp/gen t/s), added n_threads_batch #14219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
added gen t/s and pp t/s outputs, n-theads-batch to llama-bench
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds separate timing measurements for prompt processing and token generation in llama-bench and introduces a new command‑line argument (n_threads_batch) for batch thread specification.
- Added a new parameter “n_threads_batch” across configuration, parsing, test execution, and output formatting.
- Integrated separate metrics for prompt and generation timing (samples_prompt_ns, samples_gen_ns) and updated the markdown and SQL printers to display the new metrics.
Comments suppressed due to low confidence (2)
tools/llama-bench/llama-bench.cpp:1617
- [nitpick] Mapping the field 'n_threads_batch' to the alias 'th_batch' might be unclear; consider either using a more descriptive alias or adding an inline comment to explain the abbreviation.
if (field == "n_threads_batch") {
tools/llama-bench/llama-bench.cpp:2070
- [nitpick] Although separate timing measurements for prompt and generation are implemented, adding inline comments to explain the timing logic can improve clarity for future maintainers.
uint64_t t_start = get_time_ns() - t_start;
if (samples_prompt_ns.empty() || n_prompt == 0) return {}; | ||
std::vector<double> ts; | ||
std::transform(samples_prompt_ns.begin(), samples_prompt_ns.end(), std::back_inserter(ts), | ||
[this](uint64_t t) { return 1e9 * n_prompt / t; }); | ||
return ts; | ||
} | ||
|
||
std::vector<double> get_gen_ts() const { | ||
if (samples_gen_ns.empty() || n_gen == 0) return {}; | ||
std::vector<double> ts; | ||
std::transform(samples_gen_ns.begin(), samples_gen_ns.end(), std::back_inserter(ts), | ||
[this](uint64_t t) { return 1e9 * n_gen / t; }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The get_prompt_ts() and get_gen_ts() functions contain very similar code. Consider extracting a common helper function to reduce duplication.
if (samples_prompt_ns.empty() || n_prompt == 0) return {}; | |
std::vector<double> ts; | |
std::transform(samples_prompt_ns.begin(), samples_prompt_ns.end(), std::back_inserter(ts), | |
[this](uint64_t t) { return 1e9 * n_prompt / t; }); | |
return ts; | |
} | |
std::vector<double> get_gen_ts() const { | |
if (samples_gen_ns.empty() || n_gen == 0) return {}; | |
std::vector<double> ts; | |
std::transform(samples_gen_ns.begin(), samples_gen_ns.end(), std::back_inserter(ts), | |
[this](uint64_t t) { return 1e9 * n_gen / t; }); | |
return get_ts_helper(samples_prompt_ns, n_prompt); | |
} | |
std::vector<double> get_gen_ts() const { | |
return get_ts_helper(samples_gen_ns, n_gen); | |
} | |
private: | |
std::vector<double> get_ts_helper(const std::vector<uint64_t>& samples, int n) const { | |
if (samples.empty() || n == 0) return {}; | |
std::vector<double> ts; | |
std::transform(samples.begin(), samples.end(), std::back_inserter(ts), | |
[n](uint64_t t) { return 1e9 * n / t; }); |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional, possibly for a follow on PR
|
n-threads-batch: I think we have two different schools of thought, I view llama-bench as a tool to get detailed information to fine tune performance of your model for a certain system, model, multi-model server, workflow, etc. As such, more parameters than can give the user insight into their performance are a value-add. If you are worried about this causing confusion I can update the code to only show the column when the parameter was passed. pp/gen t/s: similarly, this is an effective data point and (if such a thing exists) an industry standard when it comes to reviewing bench performance of new models, quants etc. this is an incredibly valuable data point for accessing models, workflows etc. if you are that worried I can add functionality to hide the extra token/s columns with another parameter. I view these as elementary data points though and standard information provided in all backends when interfacing |
I am sorry, but I don't see the point of any of these changes. Users can test different thread numbers, and when running llama.cpp normally they can use the number of threads that performed the best with generation with |
--threads-batch is not a llama-bench parameter though
The use case may not appeal too you but, adding this allows users to benchmark a certain paramater - this is a benchmarking tool afterall. If someone wants to explore the interelation between thread-batch and other settings, have at it, these are parameters that work with llama-cli or sever so why not be able to test them. The earlier output I showed does show minor differences in performance when combining various threads/threads-batch combinations. Regarding you last comment about pp/tg, how can one process the data differently if the default output format into json, md, etc. doesn't provide the level of fidelity for a user to even estimate pp/tg t/s with a script? see the default output format - you can only infer pp t/s if -gen is 0:
You may not see the utltity in this functionality but if you look at enough models online you see users routinely measuring benchmark performance in gen and pp t/s. Averaging these into one parameter limits one's ability to refine settings, model selection, etc to a certain workflow/use case. |
added gen t/s and pp t/s outputs to lamma-bench
n-theads-batch args to llama-bench
Minor improvments to llama-bench
New Features
Example output: