Doubts on 1st token latency decay

Hi,
my benchmark performance has significant decay with the repo performance docs.
Especialy the 1st token latency. Could anyone help check it?

GPU: 1 x A100 80GB
CPU: 96 x Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz
Binary: gptManagerBenchmark

**Model build:**
```bash
precision=float16
python build.py --model_dir /models/llama-2-7b-chat-hf/ \
	--dtype ${precision} \
	--use_gpt_attention_plugin ${precision} \
	--use_gemm_plugin ${precision} \
	--max_batch_size ${max_batch_size} \
	--max_input_len ${max_input_len} \
	--max_output_len ${max_output_len} \
	--output_dir /models/llama2-7b-chat-hf/${precision}-${max_batch_size}-${max_input_len}-${max_output_len}/1-gpu/ \
	--use_inflight_batching \
	--paged_kv_cache \
	--remove_input_padding \
        --enable_context_fmha
```
**benchmark:**
```bash
${proj_dir}/cpp/bbb/benchmarks/gptManagerBenchmark \
	--model=llama \
	--engine_dir=/models/llama2-7b-chat-hf/${precision}-${max_batch_size}-${max_input_len}-${max_output_len}/1-gpu/ \
	--dataset=${proj_dir}/benchmarks/cpp/preprocessed_dataset.json \
	--log_level=info
```

**Throughput:**
Requests have different length of input_ids and request_output_len, the average is:
input_ids: 19
request_output_len: 299
Model | max_batch_size, max_input_len, max_output_len | total requests | output tokens/s
-- | -- | -- | --
llama2-7b | 16, 2048, 2048 | 100 | 755.615
llama2-7b | 32, 2048, 2048 | 100 | 906.030
llama2-7b | 64, 2048, 2048 | 100 | 977.744
llama2-7b | 128, 2048, 2048 | 100 | 986.230
llama2-7b | 16, 2048, 2048 | 1000 | 1193.203
llama2-7b | 32, 2048, 2048 | 1000 | 1978.126
llama2-7b | 64, 2048, 2048 | 1000 | 2860.245
llama2-7b | 128, 2048, 2048 | 1000 | 3227.109

**1st token latency:**
Model | max_batch_size, max_input_len, max_output_len | total requests | input tokens | first token latency (ms) 
 -- | -- | -- | -- | -- 
llama2-7b | 1, 128, 128 | 1 | 19 | 142
llama2-7b| 1, 128, 128 | 1 | 128 | 163
llama2-7b | 128, 2048, 2048 | 1 | 128 | 161
llama2-7b| 128, 2048, 2048 | 1 | 2048 | 289



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Doubts on 1st token latency decay #154

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model	max_batch_size, max_input_len, max_output_len	total requests	output tokens/s
llama2-7b	16, 2048, 2048	100	755.615
llama2-7b	32, 2048, 2048	100	906.030
llama2-7b	64, 2048, 2048	100	977.744
llama2-7b	128, 2048, 2048	100	986.230
llama2-7b	16, 2048, 2048	1000	1193.203
llama2-7b	32, 2048, 2048	1000	1978.126
llama2-7b	64, 2048, 2048	1000	2860.245
llama2-7b	128, 2048, 2048	1000	3227.109

Model	max_batch_size, max_input_len, max_output_len	total requests	input tokens	first token latency (ms)
llama2-7b	1, 128, 128	1	19	142
llama2-7b	1, 128, 128	1	128	163
llama2-7b	128, 2048, 2048	1	128	161
llama2-7b	128, 2048, 2048	1	2048	289

Doubts on 1st token latency decay #154

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions