Skip to content

Doubts on 1st token latency decay #154

@ljayx

Description

@ljayx

Hi,
my benchmark performance has significant decay with the repo performance docs.
Especialy the 1st token latency. Could anyone help check it?

GPU: 1 x A100 80GB
CPU: 96 x Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz
Binary: gptManagerBenchmark

Model build:

precision=float16
python build.py --model_dir /models/llama-2-7b-chat-hf/ \
	--dtype ${precision} \
	--use_gpt_attention_plugin ${precision} \
	--use_gemm_plugin ${precision} \
	--max_batch_size ${max_batch_size} \
	--max_input_len ${max_input_len} \
	--max_output_len ${max_output_len} \
	--output_dir /models/llama2-7b-chat-hf/${precision}-${max_batch_size}-${max_input_len}-${max_output_len}/1-gpu/ \
	--use_inflight_batching \
	--paged_kv_cache \
	--remove_input_padding \
        --enable_context_fmha

benchmark:

${proj_dir}/cpp/bbb/benchmarks/gptManagerBenchmark \
	--model=llama \
	--engine_dir=/models/llama2-7b-chat-hf/${precision}-${max_batch_size}-${max_input_len}-${max_output_len}/1-gpu/ \
	--dataset=${proj_dir}/benchmarks/cpp/preprocessed_dataset.json \
	--log_level=info

Throughput:
Requests have different length of input_ids and request_output_len, the average is:
input_ids: 19
request_output_len: 299

Model max_batch_size, max_input_len, max_output_len total requests output tokens/s
llama2-7b 16, 2048, 2048 100 755.615
llama2-7b 32, 2048, 2048 100 906.030
llama2-7b 64, 2048, 2048 100 977.744
llama2-7b 128, 2048, 2048 100 986.230
llama2-7b 16, 2048, 2048 1000 1193.203
llama2-7b 32, 2048, 2048 1000 1978.126
llama2-7b 64, 2048, 2048 1000 2860.245
llama2-7b 128, 2048, 2048 1000 3227.109

1st token latency:

Model max_batch_size, max_input_len, max_output_len total requests input tokens first token latency (ms)
llama2-7b 1, 128, 128 1 19 142
llama2-7b 1, 128, 128 1 128 163
llama2-7b 128, 2048, 2048 1 128 161
llama2-7b 128, 2048, 2048 1 2048 289

Metadata

Metadata

Assignees

Labels

triagedIssue has been triaged by maintainers

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions