-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathmetrics_explained.txt
91 lines (63 loc) · 3.9 KB
/
metrics_explained.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
########## A sample generated by llmperf
{
"error_code": null,
"error_msg": "",
"inter_token_latency_s": 0.022179236340013328, # measured by the model's tokenizer and including TTFT
"ttft_s": 0.04729403999999704,
"end_to_end_latency_s": 3.327015427999868,
"request_output_throughput_token_per_s": 51.99855658739881, # measured by the llama tokenizer and including TTFT
"number_total_tokens": 723, # measured by the llama tokenizer
"number_output_tokens": 173, # measured by the llama tokenizer
"number_input_tokens": 550 # measured by the llama tokenizer
}
number_output_tokens (173) / end_to_end_latency_s (3.327015427999868) = request_output_throughput_token_per_s (51.99855658739881)
number_total_tokens = number_input_tokens + number_output_tokens
Why does the significant gap exist ?
inter_token_latency_s x (number_output_tokens - 1) = 3.81 !!!!!!!!!!!!
(ttft_s + inter_token_latency_s) x (number_output_tokens - 1) = 3.86 !!!!!!!!!!!!
The measured end_to_end_latency_s = 3.327
########## ttft_s
https://github.com/ray-project/llmperf/blob/main/src/llmperf/ray_clients/openai_chat_completions_client.py#L94
The time to the first token (via HTTP SSE), measured in seconds, including both network latency and queuing delays.
########## end_to_end_latency_s
https://github.com/ray-project/llmperf/blob/main/src/llmperf/ray_clients/openai_chat_completions_client.py#L103
The total time from sending the request to receiving all responses.
########## token numbers
The token counts are measured using the "hf-internal-testing/llama-tokenizer".
https://huggingface.co/hf-internal-testing/llama-tokenizer/tree/main
However, the actual token counts processed or generated by a benchmarked model may differ
and could be either more or less than these number.
https://github.com/ray-project/llmperf/blob/main/token_benchmark_ray.py#L63
https://github.com/ray-project/llmperf/blob/main/src/llmperf/utils.py#L59
https://github.com/ray-project/llmperf/blob/main/token_benchmark_ray.py#L117
Here are 2 examples:
# meta-llama/Llama-3.2-1B-Instruct
The tokens for 0: [15]
The tokens for 1: [16]
The tokens for 10: [605]
The tokens for 100: [1041]
The tokens for 999: [5500]
The tokens for 1000: [1041, 15]
# microsoft/phi-2
The tokens for 0: [15]
The tokens for 1: [16]
The tokens for 10: [940]
The tokens for 100: [3064]
The tokens for 999: [17032]
The tokens for 1000: [12825]
########## inter_token_latency_s
https://github.com/ray-project/llmperf/blob/main/src/llmperf/ray_clients/openai_chat_completions_client.py#L112
https://github.com/ray-project/llmperf/blob/main/token_benchmark_ray.py#L121
Total Time (TTFS + Generation Time) / Generated token number (measured by the model's tokenizer, not the llama-tokenizer)
########## request_output_throughput_token_per_s
https://github.com/ray-project/llmperf/blob/main/src/llmperf/ray_clients/openai_chat_completions_client.py#L104
https://github.com/ray-project/llmperf/blob/main/src/llmperf/ray_clients/openai_chat_completions_client.py#L115
https://github.com/ray-project/llmperf/blob/main/token_benchmark_ray.py#L117
https://github.com/ray-project/llmperf/blob/main/token_benchmark_ray.py#L126
Generated token number (measured by the llama-tokenizer) / Total Time (TTFS + Generation Time)
########## The original_inter_token_latency_s, using the model's tokenizer and excluding TTFT
The original_number_output_tokens = end_to_end_latency_s (3.327015427999868) / inter_token_latency_s (0.022179236340013328) = 150 tokens
The original_inter_token_latency_s = (end_to_end_latency_s - ttft_s) / (original_number_output_tokens - 1)
The original_inter_token_latency_s = (3.327015427999868 - 0.04729403999999704) / (150 - 1) = 0.022 s
########## The original_output_throughput_token_per_s, using the model's tokenizer and excluding TTFT
1 / original_inter_token_latency_s (0.022) = 45.45 tokens per second