Skip to content

A modular, extensible LLM inference benchmarking framework that supports multiple benchmarking frameworks and paradigms.

License

Notifications You must be signed in to change notification settings

CentML/flexible-inference-bench

Repository files navigation

Flexible Inference Benchmarker

A modular, extensible LLM inference benchmarking framework that supports multiple benchmarking frameworks and paradigms.

This benchmarking framework operates entirely externally to any serving framework, and can easily be extended and modified. It is intended to be fully-featured to provide a variety of statistics and profiling modes.

Installation

cd flexible-inference-bench
pip install .

Usage

After installing with the above instructions, the benchmarker can be invoked with fib benchmark [options].

After benchmarking, the results are saved to output-file.json (or specified by --output-file) and can be postprocessed using fib analyse <file> | fib generate-ttft-plot [options] | fib generate-itl-plot [options].

Parameters for fib benchmark

argument description
--seed Seed for reproducibility.
--backend (-b) Backend options: tgi,vllm,cserve,cserve-debug,lmdeploy,deepspeed-mii,openai,openai-chat,tensorrt-llm.
For tensorrt-llm temperature is set to 0.01 since NGC container >= 24.06 does not support 0.0
--base-url Server or API base url, without endpoint.
--endpoint API endpoint path.
one of
--num-of-req (-n) or
--max-time-for-reqs (--timeout)

Total number of requests to send
time window for sending requests (in seconds)
--request-distribution Distribution for sending requests:
eg: exponential 5 (request will follow an exponential distribution with an average time between requests of 5 seconds)
options:
poisson rate
uniform min_val max_val
normal mean std.
--request-rate (-rps) Sets the request distribution to poisson N, such that approximately N requests are sent per second.
--input-token-distribution Request distribution for prompt length. eg:
uniform min_val max_val
normal mean std.
--output-token-distribution Request distribution for output token length. eg:
uniform min_val max_val
normal mean std.
--workload (-w) One of a few presets that define the input and output token distributions for common use-cases.
one of:
--prefix-text or
--prefix-len

Text to use as prefix for all requests.
Length of prefix to use for all requests. If neither are provided, no prefix is used.
--dataset-name (--dataset) Name of the dataset to benchmark on
{sharegpt,other,random}.
--dataset-path Path to the dataset. If sharegpt is the dataset and this is not provided, it will be automatically downloaded and cached. Otherwise, the dataset name will default to other.
--model (-m) Name of the model.
--tokenizer Name or path of the tokenizer, if not using the default tokenizer.
--disable-tqdm Specify to disable tqdm progress bar.
--best-of Number of best completions to return.
--use-beam-search Use beam search for completions.
--output-file Output json file to save the results.
--debug Log debug messages.
--verbose Summarize each request.
--disable-ignore-eos Ignores end of sequence.
Note: Not valid argument for TensorRT-LLM
--disable-stream The requests are send with Stream: False. (Used for APIs without an stream option)
--cookies Include cookies in the request.
--config-file Path to configuration file.

In addition to providing these arguments on the command-line, you can use --config-file to pre-define the parameters for your use case. Examples are provided in examples/

Output

In addition to printing the analysis results (which can be reproduced using fib analyse), the following output artifact is generated:

The output json file contains metadata and a list of request input and output descriptions:

  • backend: Backend used
  • time: Total time
  • outputs:
    • text: Generated text
    • success: Whether the request was successful
    • latency: End-to-end time for the request
    • ttft: Time to first token
    • itl: Inter-token latency
    • prompt_len: Length of the prompt
    • error: Error message
  • inputs: List of [prompt string, input tokens, expected output tokens]
  • tokenizer: Tokenizer name
  • stream: Indicates if we used the stream argument or not

Data Postprocessors

Below is a description of the data postprocessors.

fib analyse <path_to_file>

Prints the following output for a given run, same as vLLM.

============ Serving Benchmark Result ============
Successful requests:                     20
Benchmark duration (s):                  19.39
Total input tokens:                      407
Total generated tokens:                  5112
Request throughput (req/s):              1.03
Input token throughput (tok/s):          20.99
Output token throughput (tok/s):         263.66
---------------Time to First Token----------------
Mean TTFT (ms):                          24.66
Median TTFT (ms):                        24.64
P99 TTFT (ms):                           34.11
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          2295.86
Median TPOT (ms):                        2362.54
P99 TPOT (ms):                           2750.76
==================================================

fib generate-itl-plot

Returns a plot of inter-token latencies for a specific request. Takes the following args:

argument description
--datapath Path to the output json file produced.
--output Path to save figure supported by matplotlib.
--request-num Which request to produce ITL plot for.

fib generate-ttft-plot

Generates a simple CDF plot of time to first token requests. You can pass a single file or a list of generated files from the benchmark to make a comparisson

argument description
--files file(s) to generate the plot

Example

Let's take vllm as the backend for our benchmark. You can install vllm with the command:
pip install vllm

We will use gpt2 as the model
vllm serve gpt2

And now we can run the benchmark in the CLI:

fib benchmark -n 500 -rps inf -w summary

Alternatively we can go to the examples folder and run the inference benchmark using a config file:

cd examples
fib benchmark --config-file summary_throughput.json --output-file vllm-benchmark.json
============ Serving Benchmark Result ============
Successful requests:                     497      
Benchmark duration (s):                  5.09     
Total input tokens:                      58605    
Total generated tokens:                  126519   
Request throughput (req/s):              97.66    
Input token throughput (tok/s):          11516.12 
Output token throughput (tok/s):         24861.49 
---------------Time to First Token----------------
Mean TTFT (ms):                          1508.38  
Median TTFT (ms):                        372.63   
P99 TTFT (ms):                           2858.80  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          9.34     
Median TPOT (ms):                        9.39     
P99 TPOT (ms):                           10.23    
---------------Inter-token Latency----------------
Mean ITL (ms):                           9.35     
Median ITL (ms):                         8.00     
P99 ITL (ms):                            89.88    
==================================================

About

A modular, extensible LLM inference benchmarking framework that supports multiple benchmarking frameworks and paradigms.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published