diff --git a/benchmarks/README.md b/benchmarks/README.md index d8b98db499..9f277a3627 100644 --- a/benchmarks/README.md +++ b/benchmarks/README.md @@ -1,41 +1,44 @@ # Introduction -This document outlines the benchmarking process for vllm-ascend, designed to evaluate its performance under various workloads. The primary goal is to help developers assess whether their pull requests improve or degrade vllm-ascend's performance.To maintain consistency with the vllm community, we have reused the vllm community [benchmark](https://github.com/vllm-project/vllm/tree/main/benchmarks) script. +This document outlines the benchmarking methodology for vllm-ascend, aimed at evaluating the performance under a variety of workloads. The primary goal is to help developers assess whether their pull requests improve or degrade vllm-ascend's performance. To maintain alignment with vLLM, we use the [benchmark](https://github.com/vllm-project/vllm/tree/main/benchmarks) script provided by the vllm project. + # Overview **Benchmarking Coverage**: We measure latency, throughput, and fixed-QPS serving on the Atlas800I A2 (see [quick_start](../docs/source/quick_start.md) to learn more supported devices list), with different models(coming soon). - Latency tests - Input length: 32 tokens. - Output length: 128 tokens. - Batch size: fixed (8). - - Models: llama-3.1 8B. + - Models: Qwen2.5-7B-Instruct, Qwen/Qwen2.5-VL-7B-Instruct. - Evaluation metrics: end-to-end latency (mean, median, p99). - Throughput tests - Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed). - Output length: the corresponding output length of these 200 prompts. - Batch size: dynamically determined by vllm to achieve maximum throughput. - - Models: llama-3.1 8B . + - Models: Qwen2.5-7B-Instruct, Qwen/Qwen2.5-VL-7B-Instruct. - Evaluation metrics: throughput. - Serving tests - Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed). - Output length: the corresponding output length of these 200 prompts. - Batch size: dynamically determined by vllm and the arrival pattern of the requests. - **Average QPS (query per second)**: 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed). - - Models: llama-3.1 8B. + - Models: Qwen2.5-7B-Instruct, Qwen/Qwen2.5-VL-7B-Instruct. - Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99). -**Benchmarking Duration**: about 800senond for single model. +**Benchmarking Duration**: about 800 senond for single model. # Quick Use ## Prerequisites Before running the benchmarks, ensure the following: + - vllm and vllm-ascend are installed and properly set up in an NPU environment, as these scripts are specifically designed for NPU devices. + - Install necessary dependencies for benchmarks: ``` pip install -r benchmarks/requirements-bench.txt ``` -- Models and datasets are cached locally to accelerate execution. Modify the paths in the JSON files located in benchmarks/tests accordingly. feel free to add your own models and parameters in the JSON to run your customized benchmarks. +- For performance benchmark, it is recommended to set the [load-format](https://github.com/vllm-project/vllm-ascend/blob/5897dc5bbe321ca90c26225d0d70bff24061d04b/benchmarks/tests/latency-tests.json#L7) as `dummy`, It will construct random weights based on the passed model without downloading the weights from internet, which can greatly reduce the benchmark time. feel free to add your own models and parameters in the JSON to run your customized benchmarks. ## Run benchmarks The provided scripts automatically execute performance tests for serving, throughput, and latency. To start the benchmarking process, run command in the vllm-ascend root directory: @@ -44,11 +47,19 @@ bash benchmarks/scripts/run-performance-benchmarks.sh ``` Once the script completes, you can find the results in the benchmarks/results folder. The output files may resemble the following: ``` -|-- latency_llama8B_tp1.json -|-- serving_llama8B_tp1_sharegpt_qps_1.json -|-- serving_llama8B_tp1_sharegpt_qps_16.json -|-- serving_llama8B_tp1_sharegpt_qps_4.json -|-- serving_llama8B_tp1_sharegpt_qps_inf.json -|-- throughput_llama8B_tp1.json +. +|-- serving_qwen2_5_7B_tp1_qps_1.json +|-- serving_qwen2_5_7B_tp1_qps_16.json +|-- serving_qwen2_5_7B_tp1_qps_4.json +|-- serving_qwen2_5_7B_tp1_qps_inf.json +|-- serving_qwen2_5vl_7B_tp1_qps_1.json +|-- serving_qwen2_5vl_7B_tp1_qps_16.json +|-- serving_qwen2_5vl_7B_tp1_qps_4.json +`-- serving_qwen2_5vl_7B_tp1_qps_inf.json ``` These files contain detailed benchmarking results for further analysis. + +To view the results more intuitively, you can use [script](./scripts/convert_json_to_markdown.py) convert these json to markdown: +```bash +python benchmarks/scripts/convert_json_to_markdown.py +``` diff --git a/benchmarks/scripts/convert_json_to_markdown.py b/benchmarks/scripts/convert_json_to_markdown.py new file mode 100644 index 0000000000..7a1c5d9968 --- /dev/null +++ b/benchmarks/scripts/convert_json_to_markdown.py @@ -0,0 +1,183 @@ +import argparse +import json +import os +from pathlib import Path + +import pandas as pd +from tabulate import tabulate + +CUR_PATH = Path(__file__).parent.resolve() +# latency results and the keys that will be printed into markdown +latency_results = [] +latency_column_mapping = { + "test_name": "Test name", + "avg_latency": "Mean latency (ms)", + "P50": "Median latency (ms)", + "P99": "P99 latency (ms)", +} + +# throughput tests and the keys that will be printed into markdown +throughput_results = [] +throughput_results_column_mapping = { + "test_name": "Test name", + "num_requests": "Num of reqs", + "total_num_tokens": "Total num of tokens", + "elapsed_time": "Elapsed time (s)", + "requests_per_second": "Tput (req/s)", + "tokens_per_second": "Tput (tok/s)", +} + +# serving results and the keys that will be printed into markdown +serving_results = [] +serving_column_mapping = { + "test_name": "Test name", + "request_rate": "Request rate (req/s)", + "request_throughput": "Tput (req/s)", + "output_throughput": "Output Tput (tok/s)", + "median_ttft_ms": "TTFT (ms)", + "median_tpot_ms": "TPOT (ms)", + "median_itl_ms": "ITL (ms)", +} + + +def read_markdown(file): + if os.path.exists(file): + with open(file) as f: + return f.read() + "\n" + else: + return f"{file} not found.\n" + + +def results_to_json(latency, throughput, serving): + return json.dumps({ + 'latency': latency.to_dict(), + 'throughput': throughput.to_dict(), + 'serving': serving.to_dict() + }) + + +if __name__ == "__main__": + parser = argparse.ArgumentParser( + description="Process the results of the benchmark tests.") + parser.add_argument( + "--results_folder", + type=str, + default="../results/", + help="The folder where the benchmark results are stored.") + parser.add_argument( + "--output_folder", + type=str, + default="../results/", + help="The folder where the benchmark results are stored.") + parser.add_argument("--markdown_template", + type=str, + default="./perf_result_template.md", + help="The template file for the markdown report.") + parser.add_argument("--tag", + default="main", + help="Tag to be used for release message.") + parser.add_argument("--commit_id", + default="", + help="Commit ID to be used for release message.") + + args = parser.parse_args() + results_folder = (CUR_PATH / args.results_folder).resolve() + output_folder = (CUR_PATH / args.output_folder).resolve() + markdown_template = (CUR_PATH / args.markdown_template).resolve() + + # collect results + for test_file in results_folder.glob("*.json"): + + with open(test_file) as f: + raw_result = json.loads(f.read()) + + if "serving" in str(test_file): + # this result is generated via `benchmark_serving.py` + + # update the test name of this result + raw_result.update({"test_name": test_file.stem}) + + # add the result to raw_result + serving_results.append(raw_result) + continue + + elif "latency" in f.name: + # this result is generated via `benchmark_latency.py` + + # update the test name of this result + raw_result.update({"test_name": test_file.stem}) + + # get different percentiles + for perc in [10, 25, 50, 75, 90, 99]: + # Multiply 1000 to convert the time unit from s to ms + raw_result.update( + {f"P{perc}": 1000 * raw_result["percentiles"][str(perc)]}) + raw_result["avg_latency"] = raw_result["avg_latency"] * 1000 + + # add the result to raw_result + latency_results.append(raw_result) + continue + + elif "throughput" in f.name: + # this result is generated via `benchmark_throughput.py` + + # update the test name of this result + raw_result.update({"test_name": test_file.stem}) + + # add the result to raw_result + throughput_results.append(raw_result) + continue + + print(f"Skipping {test_file}") + serving_results.sort(key=lambda x: (len(x['test_name']), x['test_name'])) + + latency_results = pd.DataFrame.from_dict(latency_results) + serving_results = pd.DataFrame.from_dict(serving_results) + throughput_results = pd.DataFrame.from_dict(throughput_results) + + raw_results_json = results_to_json(latency_results, throughput_results, + serving_results) + + # remapping the key, for visualization purpose + if not latency_results.empty: + latency_results = latency_results[list( + latency_column_mapping.keys())].rename( + columns=latency_column_mapping) + if not serving_results.empty: + serving_results = serving_results[list( + serving_column_mapping.keys())].rename( + columns=serving_column_mapping) + if not throughput_results.empty: + throughput_results = throughput_results[list( + throughput_results_column_mapping.keys())].rename( + columns=throughput_results_column_mapping) + + processed_results_json = results_to_json(latency_results, + throughput_results, + serving_results) + + # get markdown tables + latency_md_table = tabulate(latency_results, + headers='keys', + tablefmt='pipe', + showindex=False) + serving_md_table = tabulate(serving_results, + headers='keys', + tablefmt='pipe', + showindex=False) + throughput_md_table = tabulate(throughput_results, + headers='keys', + tablefmt='pipe', + showindex=False) + + # document the result + print(output_folder) + with open(output_folder / "benchmark_results.md", "w") as f: + + results = read_markdown(markdown_template) + results = results.format( + latency_tests_markdown_table=latency_md_table, + throughput_tests_markdown_table=throughput_md_table, + serving_tests_markdown_table=serving_md_table, + benchmarking_results_in_json_string=processed_results_json) + f.write(results) diff --git a/benchmarks/scripts/perf_result_template.md b/benchmarks/scripts/perf_result_template.md new file mode 100644 index 0000000000..b61f82cd9e --- /dev/null +++ b/benchmarks/scripts/perf_result_template.md @@ -0,0 +1,31 @@ +## Online serving tests + +- Input length: randomly sample 200 prompts from [ShareGPT](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/blob/main/ShareGPT_V3_unfiltered_cleaned_split.json) and [lmarena-ai/vision-arena-bench-v0.1](https://huggingface.co/datasets/lmarena-ai/vision-arena-bench-v0.1/tree/main)(multi-modal) dataset (with fixed random seed). +- Output length: the corresponding output length of these 200 prompts. +- Batch size: dynamically determined by vllm and the arrival pattern of the requests. +- **Average QPS (query per second)**: 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed). +- Models: Qwen/Qwen2.5-7B-Instruct, Qwen/Qwen2.5-VL-7B-Instruct +- Evaluation metrics: throughput, TTFT (median time to the first token ), ITL (median inter-token latency) TPOT(median time per output token). + +{serving_tests_markdown_table} + +## Offline tests +### Latency tests + +- Input length: 32 tokens. +- Output length: 128 tokens. +- Batch size: fixed (8). +- Models: Qwen/Qwen2.5-7B-Instruct, Qwen/Qwen2.5-VL-7B-Instruct +- Evaluation metrics: end-to-end latency. + +{latency_tests_markdown_table} + +### Throughput tests + +- Input length: randomly sample 200 prompts from [ShareGPT](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/blob/main/ShareGPT_V3_unfiltered_cleaned_split.json) and [lmarena-ai/vision-arena-bench-v0.1](https://huggingface.co/datasets/lmarena-ai/vision-arena-bench-v0.1/tree/main)(multi-modal) dataset (with fixed random seed). +- Output length: the corresponding output length of these 200 prompts. +- Batch size: dynamically determined by vllm to achieve maximum throughput. +- Models: Qwen/Qwen2.5-7B-Instruct, Qwen/Qwen2.5-VL-7B-Instruct +- Evaluation metrics: throughput. + +{throughput_tests_markdown_table} \ No newline at end of file diff --git a/benchmarks/scripts/run-performance-benchmarks.sh b/benchmarks/scripts/run-performance-benchmarks.sh index b0e3121319..49d87a1412 100644 --- a/benchmarks/scripts/run-performance-benchmarks.sh +++ b/benchmarks/scripts/run-performance-benchmarks.sh @@ -48,9 +48,10 @@ wait_for_server() { # wait for vllm server to start # return 1 if vllm server crashes timeout 1200 bash -c ' - until curl -X POST localhost:8000/v1/completions; do + until curl -s -X POST localhost:8000/v1/completions || curl -s -X POST localhost:8000/v1/chat/completions; do sleep 1 done' && return 0 || return 1 + } get_cur_npu_id() { @@ -241,11 +242,13 @@ run_serving_tests() { cleanup() { rm -rf ./vllm_benchmarks } - get_benchmarks_scripts() { - git clone -b main --depth=1 git@github.com:vllm-project/vllm.git && \ - mv vllm/benchmarks vllm_benchmarks - rm -rf ./vllm + git clone --depth=1 --filter=blob:none --sparse https://github.com/vllm-project/vllm || return 1 + cd vllm || return 1 + git sparse-checkout set benchmarks || return 1 + mv benchmarks ../vllm_benchmarks || return 1 + cd .. || return 1 + rm -rf vllm || return 1 } main() { @@ -287,7 +290,6 @@ main() { END_TIME=$(date +%s) ELAPSED_TIME=$((END_TIME - START_TIME)) echo "Total execution time: $ELAPSED_TIME seconds" - } main "$@" diff --git a/benchmarks/tests/latency-tests.json b/benchmarks/tests/latency-tests.json index a9b951f001..591b63cf34 100644 --- a/benchmarks/tests/latency-tests.json +++ b/benchmarks/tests/latency-tests.json @@ -1,10 +1,22 @@ [ { - "test_name": "latency_llama8B_tp1", + "test_name": "latency_qwen2_5_7B_tp1", "parameters": { - "model": "LLM-Research/Meta-Llama-3.1-8B-Instruct", + "model": "Qwen/Qwen2.5-7B-Instruct", "tensor_parallel_size": 1, "load_format": "dummy", + "max_model_len": 16384, + "num_iters_warmup": 5, + "num_iters": 15 + } + }, + { + "test_name": "latency_qwen2_5vl_7B_tp1", + "parameters": { + "model": "Qwen/Qwen2.5-VL-7B-Instruct", + "tensor_parallel_size": 1, + "load_format": "dummy", + "max_model_len": 16384, "num_iters_warmup": 5, "num_iters": 15 } diff --git a/benchmarks/tests/serving-tests.json b/benchmarks/tests/serving-tests.json index fe200b4ea2..9e54f4db9c 100644 --- a/benchmarks/tests/serving-tests.json +++ b/benchmarks/tests/serving-tests.json @@ -1,6 +1,6 @@ [ { - "test_name": "serving_llama8B_tp1", + "test_name": "serving_qwen2_5vl_7B_tp1", "qps_list": [ 1, 4, @@ -8,7 +8,34 @@ "inf" ], "server_parameters": { - "model": "LLM-Research/Meta-Llama-3.1-8B-Instruct", + "model": "Qwen/Qwen2.5-VL-7B-Instruct", + "tensor_parallel_size": 1, + "swap_space": 16, + "disable_log_stats": "", + "disable_log_requests": "", + "trust_remote_code": "", + "max_model_len": 16384 + }, + "client_parameters": { + "model": "Qwen/Qwen2.5-VL-7B-Instruct", + "backend": "openai-chat", + "dataset_name": "hf", + "hf_split": "train", + "endpoint": "/v1/chat/completions", + "dataset_path": "lmarena-ai/vision-arena-bench-v0.1", + "num_prompts": 200 + } + }, + { + "test_name": "serving_qwen2_5_7B_tp1", + "qps_list": [ + 1, + 4, + 16, + "inf" + ], + "server_parameters": { + "model": "Qwen/Qwen2.5-7B-Instruct", "tensor_parallel_size": 1, "swap_space": 16, "disable_log_stats": "", @@ -16,7 +43,7 @@ "load_format": "dummy" }, "client_parameters": { - "model": "LLM-Research/Meta-Llama-3.1-8B-Instruct", + "model": "Qwen/Qwen2.5-7B-Instruct", "backend": "vllm", "dataset_name": "sharegpt", "dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json", diff --git a/benchmarks/tests/throughput-tests.json b/benchmarks/tests/throughput-tests.json index 41f8ab2258..d5588896c4 100644 --- a/benchmarks/tests/throughput-tests.json +++ b/benchmarks/tests/throughput-tests.json @@ -1,14 +1,29 @@ [ { - "test_name": "throughput_llama8B_tp1", + "test_name": "throughput_qwen2_5_7B_tp1", "parameters": { - "model": "LLM-Research/Meta-Llama-3.1-8B-Instruct", + "model": "Qwen/Qwen2.5-7B-Instruct", "tensor_parallel_size": 1, "load_format": "dummy", "dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json", "num_prompts": 200, "backend": "vllm" } + }, + { + "test_name": "throughput_qwen2_5vl_7B_tp1", + "parameters": { + "model": "Qwen/Qwen2.5-VL-7B-Instruct", + "tensor_parallel_size": 1, + "load_format": "dummy", + "backend": "openai-chat", + "dataset_name": "hf", + "hf_split": "train", + "max_model_len": 16384, + "endpoint": "/v1/chat/completions", + "dataset_path": "lmarena-ai/vision-arena-bench-v0.1", + "num_prompts": 200 + } } ] diff --git a/docs/source/developer_guide/evaluation/index.md b/docs/source/developer_guide/evaluation/index.md index 0ebba61fa2..caca8694e6 100644 --- a/docs/source/developer_guide/evaluation/index.md +++ b/docs/source/developer_guide/evaluation/index.md @@ -7,3 +7,9 @@ using_opencompass using_lm_eval accuracy_report/index ::: + +:::{toctree} +:caption: Performance +:maxdepth: 1 +performance_benchmark +::: \ No newline at end of file diff --git a/docs/source/developer_guide/evaluation/performance_benchmark.md b/docs/source/developer_guide/evaluation/performance_benchmark.md new file mode 100644 index 0000000000..91a013c0a0 --- /dev/null +++ b/docs/source/developer_guide/evaluation/performance_benchmark.md @@ -0,0 +1,180 @@ +# Performance Benchmark +This document details the benchmark methodology for vllm-ascend, aimed at evaluating the performance under a variety of workloads. To maintain alignment with vLLM, we use the [benchmark](https://github.com/vllm-project/vllm/tree/main/benchmarks) script provided by the vllm project. + +**Benchmark Coverage**: We measure offline e2e latency and throughput, and fixed-QPS online serving benchmarks, for more details see [vllm-ascend benchmark scripts](https://github.com/vllm-project/vllm-ascend/tree/v0.7.3-dev/benchmarks). + +## 1. Run docker container +```{code-block} bash + :substitutions: +# Update DEVICE according to your device (/dev/davinci[0-7]) +export DEVICE=/dev/davinci7 +export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version| +docker run --rm \ +--name vllm-ascend \ +--device $DEVICE \ +--device /dev/davinci_manager \ +--device /dev/devmm_svm \ +--device /dev/hisi_hdc \ +-v /usr/local/dcmi:/usr/local/dcmi \ +-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ +-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \ +-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \ +-v /etc/ascend_install.info:/etc/ascend_install.info \ +-v /root/.cache:/root/.cache \ +-e VLLM_USE_MODELSCOPE=True \ +-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \ +-it $IMAGE \ +/bin/bash +``` + +## 2. Install dependencies +```bash +cd /workspace/vllm-ascend +pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple +pip install -r benchmarks/requirements-bench.txt +``` + +## 3. (Optional)Prepare model weights +For faster running speed, we recommend downloading the model in advance: +```bash +modelscope download --model LLM-Research/Meta-Llama-3.1-8B-Instruct +``` +For a faster, lighter testing, it is recommend to set the parameter `load-format` as `dummy`, +and random weight values ​​will be constructed based on the incoming model structure, which avoids +the time spent downloading the model from the Internet. + +You can also replace all model paths in the [json](https://github.com/vllm-project/vllm-ascend/tree/v0.7.3-dev/benchmarks/tests) files with your local paths and other parameters passed in: +```bash +[ + { + "test_name": "latency_llama8B_tp1", + "parameters": { + "model": "/path/to/model", + "tensor_parallel_size": 1, + "load_format": "dummy", + "num_iters_warmup": 5, + "num_iters": 15 + } + } +] +``` + +## 4. Run benchmark script +Run benchmark script: +```bash +bash benchmarks/scripts/run-performance-benchmarks.sh +``` + +After about 10 mins, the output is as shown below: +```bash +online serving: +qps 1: +============ Serving Benchmark Result ============ +Successful requests: 200 +Benchmark duration (s): 212.77 +Total input tokens: 42659 +Total generated tokens: 43545 +Request throughput (req/s): 0.94 +Output token throughput (tok/s): 204.66 +Total Token throughput (tok/s): 405.16 +---------------Time to First Token---------------- +Mean TTFT (ms): 104.14 +Median TTFT (ms): 102.22 +P99 TTFT (ms): 153.82 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 38.78 +Median TPOT (ms): 38.70 +P99 TPOT (ms): 48.03 +---------------Inter-token Latency---------------- +Mean ITL (ms): 38.46 +Median ITL (ms): 36.96 +P99 ITL (ms): 75.03 +================================================== + +qps 4: +============ Serving Benchmark Result ============ +Successful requests: 200 +Benchmark duration (s): 72.55 +Total input tokens: 42659 +Total generated tokens: 43545 +Request throughput (req/s): 2.76 +Output token throughput (tok/s): 600.24 +Total Token throughput (tok/s): 1188.27 +---------------Time to First Token---------------- +Mean TTFT (ms): 115.62 +Median TTFT (ms): 109.39 +P99 TTFT (ms): 169.03 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 51.48 +Median TPOT (ms): 52.40 +P99 TPOT (ms): 69.41 +---------------Inter-token Latency---------------- +Mean ITL (ms): 50.47 +Median ITL (ms): 43.95 +P99 ITL (ms): 130.29 +================================================== + +qps 16: +============ Serving Benchmark Result ============ +Successful requests: 200 +Benchmark duration (s): 47.82 +Total input tokens: 42659 +Total generated tokens: 43545 +Request throughput (req/s): 4.18 +Output token throughput (tok/s): 910.62 +Total Token throughput (tok/s): 1802.70 +---------------Time to First Token---------------- +Mean TTFT (ms): 128.50 +Median TTFT (ms): 128.36 +P99 TTFT (ms): 187.87 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 83.60 +Median TPOT (ms): 77.85 +P99 TPOT (ms): 165.90 +---------------Inter-token Latency---------------- +Mean ITL (ms): 65.72 +Median ITL (ms): 54.84 +P99 ITL (ms): 289.63 +================================================== + +qps inf: +============ Serving Benchmark Result ============ +Successful requests: 200 +Benchmark duration (s): 41.26 +Total input tokens: 42659 +Total generated tokens: 43545 +Request throughput (req/s): 4.85 +Output token throughput (tok/s): 1055.44 +Total Token throughput (tok/s): 2089.40 +---------------Time to First Token---------------- +Mean TTFT (ms): 3394.37 +Median TTFT (ms): 3359.93 +P99 TTFT (ms): 3540.93 +-----Time per Output Token (excl. 1st token)------ +Mean TPOT (ms): 66.28 +Median TPOT (ms): 64.19 +P99 TPOT (ms): 97.66 +---------------Inter-token Latency---------------- +Mean ITL (ms): 56.62 +Median ITL (ms): 55.69 +P99 ITL (ms): 82.90 +================================================== + +offline: +latency: +Avg latency: 4.944929537673791 seconds +10% percentile latency: 4.894104263186454 seconds +25% percentile latency: 4.909652255475521 seconds +50% percentile latency: 4.932477846741676 seconds +75% percentile latency: 4.9608619548380375 seconds +90% percentile latency: 5.035418218374252 seconds +99% percentile latency: 5.052476694583893 seconds + +throughput: +Throughput: 4.64 requests/s, 2000.51 total tokens/s, 1010.54 output tokens/s +Total num prompt tokens: 42659 +Total num output tokens: 43545 +``` +The result json files are generated into the default path `benchmark/results` +These files contain detailed benchmarking results for further analysis. + \ No newline at end of file