metrics_explained.txt

########## A sample generated by llmperf

    {
        "error_code": null,
        "error_msg": "",
        "inter_token_latency_s": 0.022179236340013328,   # measured by the model's tokenizer and including TTFT
        "ttft_s": 0.04729403999999704,
        "end_to_end_latency_s": 3.327015427999868,
        "request_output_throughput_token_per_s": 51.99855658739881, # measured by the llama tokenizer and including TTFT
        "number_total_tokens": 723,        # measured by the llama tokenizer
        "number_output_tokens": 173,       # measured by the llama tokenizer
        "number_input_tokens": 550         # measured by the llama tokenizer
    }

number_output_tokens (173) / end_to_end_latency_s (3.327015427999868) = request_output_throughput_token_per_s (51.99855658739881)
number_total_tokens = number_input_tokens + number_output_tokens

Why does the significant gap exist ?

          inter_token_latency_s  x (number_output_tokens - 1) = 3.81 !!!!!!!!!!!!
(ttft_s + inter_token_latency_s) x (number_output_tokens - 1) = 3.86 !!!!!!!!!!!!

The measured end_to_end_latency_s = 3.327


########## ttft_s
https://github.com/ray-project/llmperf/blob/main/src/llmperf/ray_clients/openai_chat_completions_client.py#L94

The time to the first token (via HTTP SSE), measured in seconds, including both network latency and queuing delays.


########## end_to_end_latency_s
https://github.com/ray-project/llmperf/blob/main/src/llmperf/ray_clients/openai_chat_completions_client.py#L103

The total time from sending the request to receiving all responses.

########## token numbers

The token counts are measured using the "hf-internal-testing/llama-tokenizer". 
https://huggingface.co/hf-internal-testing/llama-tokenizer/tree/main

However, the actual token counts processed or generated by a benchmarked model may differ 
and could be either more or less than these number.
https://github.com/ray-project/llmperf/blob/main/token_benchmark_ray.py#L63
https://github.com/ray-project/llmperf/blob/main/src/llmperf/utils.py#L59
https://github.com/ray-project/llmperf/blob/main/token_benchmark_ray.py#L117

Here are 2 examples:

# meta-llama/Llama-3.2-1B-Instruct

The tokens for 0: [15]
The tokens for 1: [16]
The tokens for 10: [605]
The tokens for 100: [1041]
The tokens for 999: [5500]
The tokens for 1000: [1041, 15]

# microsoft/phi-2

The tokens for 0: [15]
The tokens for 1: [16]
The tokens for 10: [940]
The tokens for 100: [3064]
The tokens for 999: [17032]
The tokens for 1000: [12825]

########## inter_token_latency_s
https://github.com/ray-project/llmperf/blob/main/src/llmperf/ray_clients/openai_chat_completions_client.py#L112
https://github.com/ray-project/llmperf/blob/main/token_benchmark_ray.py#L121

Total Time (TTFS + Generation Time) / Generated token number (measured by the model's tokenizer, not the llama-tokenizer) 

########## request_output_throughput_token_per_s
https://github.com/ray-project/llmperf/blob/main/src/llmperf/ray_clients/openai_chat_completions_client.py#L104
https://github.com/ray-project/llmperf/blob/main/src/llmperf/ray_clients/openai_chat_completions_client.py#L115
https://github.com/ray-project/llmperf/blob/main/token_benchmark_ray.py#L117
https://github.com/ray-project/llmperf/blob/main/token_benchmark_ray.py#L126

Generated token number (measured by the llama-tokenizer) / Total Time (TTFS + Generation Time) 

########## The original_inter_token_latency_s, using the model's tokenizer and excluding TTFT

The original_number_output_tokens = end_to_end_latency_s (3.327015427999868) / inter_token_latency_s (0.022179236340013328) = 150 tokens

The original_inter_token_latency_s = (end_to_end_latency_s - ttft_s) / (original_number_output_tokens - 1)
The original_inter_token_latency_s = (3.327015427999868 - 0.04729403999999704) / (150 - 1) = 0.022 s

########## The original_output_throughput_token_per_s, using the model's tokenizer and excluding TTFT

1 / original_inter_token_latency_s (0.022) = 45.45 tokens per second