Skip to content

[Bug]: inter-token latency is lower than TPOT in serving benchmark result #6531

@Jeffwan

Description

@Jeffwan

Your current environment

v0.5.2. vLLM env is not an issue so I will just skip the collection process

🐛 Describe the bug

I am running benchmark tests and notice one potential problem.

Seems the inter-token latency is lower than TPOT. Basically, inter-token latency takes TTFT into the consideration and should be higher than TPOT. However the data shows different result. I have not looked at the code yet and I will try to figure this out

root@fb5250e2ae4c:/workspace# python3 vllm/benchmarks/benchmark_serving.py     --backend vllm     --dataset-name sharegpt     --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json     --model meta-llama/Llama-2-7b-chat-hf     --num-prompts 200     --endpoint /v1/completions     --tokenizer meta-llama/Llama-2-7b-chat-hf     --save-result     2>&1 | tee benchmark_serving.txt
Namespace(backend='vllm', base_url=None, host='localhost', port=8000, endpoint='/v1/completions', dataset=None, dataset_name='sharegpt', dataset_path='./ShareGPT_V3_unfiltered_cleaned_split.json', model='meta-llama/Llama-2-7b-chat-hf', tokenizer='meta-llama/Llama-2-7b-chat-hf', best_of=1, use_beam_search=False, num_prompts=200, sharegpt_output_len=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, request_rate=inf, seed=0, trust_remote_code=False, disable_tqdm=False, save_result=True, metadata=None, result_dir=None, result_filename=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [01:12<00:00,  2.74it/s]s
============ Serving Benchmark Result ============
Successful requests:                     200       
Benchmark duration (s):                  72.96     
Total input tokens:                      49490     
Total generated tokens:                  41078     
Request throughput (req/s):              2.74      
Input token throughput (tok/s):          678.34    
Output token throughput (tok/s):         563.04    
---------------Time to First Token----------------
Mean TTFT (ms):                          3594.18   
Median TTFT (ms):                        3685.95   
P99 TTFT (ms):                           7361.98   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          186.90    
Median TPOT (ms):                        121.63    
P99 TPOT (ms):                           966.47    
---------------Inter-token Latency----------------
Mean ITL (ms):                           121.20    
Median ITL (ms):                         92.91     
P99 ITL (ms):                            310.89    
==================================================

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstaleOver 90 days of inactivity

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions