-
-
Notifications
You must be signed in to change notification settings - Fork 11k
Closed as not planned
Closed as not planned
Copy link
Labels
bugSomething isn't workingSomething isn't workingstaleOver 90 days of inactivityOver 90 days of inactivity
Description
Your current environment
v0.5.2. vLLM env is not an issue so I will just skip the collection process
🐛 Describe the bug
I am running benchmark tests and notice one potential problem.
Seems the inter-token latency is lower than TPOT. Basically, inter-token latency takes TTFT into the consideration and should be higher than TPOT. However the data shows different result. I have not looked at the code yet and I will try to figure this out
root@fb5250e2ae4c:/workspace# python3 vllm/benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model meta-llama/Llama-2-7b-chat-hf --num-prompts 200 --endpoint /v1/completions --tokenizer meta-llama/Llama-2-7b-chat-hf --save-result 2>&1 | tee benchmark_serving.txt
Namespace(backend='vllm', base_url=None, host='localhost', port=8000, endpoint='/v1/completions', dataset=None, dataset_name='sharegpt', dataset_path='./ShareGPT_V3_unfiltered_cleaned_split.json', model='meta-llama/Llama-2-7b-chat-hf', tokenizer='meta-llama/Llama-2-7b-chat-hf', best_of=1, use_beam_search=False, num_prompts=200, sharegpt_output_len=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, request_rate=inf, seed=0, trust_remote_code=False, disable_tqdm=False, save_result=True, metadata=None, result_dir=None, result_filename=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [01:12<00:00, 2.74it/s]s
============ Serving Benchmark Result ============
Successful requests: 200
Benchmark duration (s): 72.96
Total input tokens: 49490
Total generated tokens: 41078
Request throughput (req/s): 2.74
Input token throughput (tok/s): 678.34
Output token throughput (tok/s): 563.04
---------------Time to First Token----------------
Mean TTFT (ms): 3594.18
Median TTFT (ms): 3685.95
P99 TTFT (ms): 7361.98
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 186.90
Median TPOT (ms): 121.63
P99 TPOT (ms): 966.47
---------------Inter-token Latency----------------
Mean ITL (ms): 121.20
Median ITL (ms): 92.91
P99 ITL (ms): 310.89
==================================================
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingstaleOver 90 days of inactivityOver 90 days of inactivity