(perf-overview)=

Important

As of TensorRT-LLM v0.10, these performance benchmarks have changed methodology to utilize in-flight batching and no longer utilize static benchmarking. These numbers are initial measurements and are expected to improve in future releases.

Overview

This document summarizes performance measurements of TensorRT-LLM on H100 (Hopper), L40S (Ada) and A100 (Ampere) GPUs for a few key models.

The data in the following tables is provided as a reference point to help users validate observed performance. It should not be considered as the peak performance that can be delivered by TensorRT-LLM.

Known Issues

The following issues are being addressed to improve the efficiency of TensorRT-LLM.

Fused Matmul + Gated-SiLU (LLaMA)

The current implementation combines two Matmul operations into one Matmul followed by a separate SwiGLU kernel (when --use_fused_mlp=enable is enabled). There is also a more efficient implementation that runs single Matmul + SwiGLU fused kernel for FP8 on Hopper (when --use_fused_mlp=enable --gemm_swiglu_plugin fp8 is enabled). The gemm_swiglu_plugin will support more data types and GPU architectures in the future release.

Throughput Measurements

The below table shows performance data where a local inference client is fed requests at an infinite rate (no delay between messages), and shows the throughput client-server scenario under maximum load.

The performance numbers below were collected using the steps described in this document.

All data in the table below was generated using version 0.14.0 and presents token throughput in tokens/second.


		GPU	H200 141GB HBM3	H100 80GB HBM3	H100 80GB HBM3	A100-SXM4-80GB	A100-PCIE-80GB	L40S
		Precision	FP8	FP8	FP16	FP16	FP16	FP8
Model	Input/Output Lengths	TP Size
LLaMA v3 70B	1000/1000	1	2594.2199	464.5243
		2	4574.1197	4092.3267	776.9965	468.5805	259.1155
		4	7612.2487	6925.0844	3730.2064	1765.9123	987.1971	1159.357
		8	13075.5194	10733.0804	5963.0914	3054.8915	960.3737	1173.3517
	128/128	1	3904.1639	2551.6384
		2	5343.8677	5191.7428	3183.9714	1334.903	806.1477
		4	8829.1049	8540.5362	5837.9598	2421.4383	1275.5474	1427.9115
		8	16359.1322	15498.2004	10597.6556	4474.1621	1223.1747	1377.473
	128/2048	1	3613.7474	418.3639
		2	7112.2959	5852.0185	817.52	511.6257
		4	12772.8148	8998.3742	5072.0345	2484.2018	1471.9105	1771.4437
		8	19722.5974	15099.0633	7554.2141	4463.6602	1589.1759	1953.7918
	128/4096	1	2409.6881
		2	5687.3482	3513.0941	413.3767	273.5871
		4	8937.3115	6718.5895	3093.7358	1688.0132	1231.8104	1279.2496
		8	13976.1386	9279.1013	5001.2743	2948.5374	1350.794	1494.0776
	2048/128	1	457.5772	241.7561
		2	699.5582	690.9961	328.0399	145.088	91.1746
		4	1035.6523	1008.8318	670.6725	278.5717	150.2619	168.7886
		8	2055.7245	1996.2653	1288.7599	546.9599	140.0144	160.2741
	2048/2048	1	1802.1116	204.0931
		2	3487.2497	2444.6903	165.6522	126.1101
		4	6126.7196	4850.8285	2386.6556	1230.1833	822.2269	876.6085
		8	9784.0193	7432.6659	3991.2123	2144.3042	883.4809	994.94
	500/2000	1	2822.7846	389.8823
		2	6175.7623	4601.857	687.5386	430.6093
		4	10783.8925	9018.9053	3698.3674	2113.3936	1248.8319	1468.7827
		8	17631.9756	11375.9582	6321.3679	3673.5693	1321.8541	1636.4588
	5000/500	1	532.2603	123.8543
		2	931.8255	897.4263	227.9005	117.5698	75.35
		4	1399.7865	1316.2865	831.2804	362.3465	209.8052	234.7343
		8	2725.1283	2469.5585	1446.3508	662.5725	202.0719	231.9027
LLaMA v3.1 405B	1000/1000	8	3391.0372
	128/128	8	3766.2785
	128/2048	8	5952.1416
	128/4096	8	3944.117
	20000/2000	8	481.5732
	2048/128	8	444.5735
	2048/2048	8	2604.8557
	500/2000	8	4805.86
	5000/500	8	655.9754
LLaMA v3.1 70B	1000/1000	1	2585.0953	410.286
		2	4600.9616	4116.4444	785.4931	468.6383	257.972
		4	7607.5304	6932.8808	3774.676	1762.6831	989.4082	1161.4814
		8	13081.434	10730.156	5978.4573	3190.0211	959.8463	1188.1193
	128/128	1	3897.2623	2459.6003
		2	5357.0227	5194.8171	3207.2866	1346.9692	806.7215
		4	8826.9618	8542.3012	5846.8413	2420.8665	1272.6755	1438.0446
		8	16382.9807	15533.1169	10649.4968	4572.3445	1212.0566	1381.7051
	128/2048	1	3612.2603	445.7773
		2	7054.7235	5869.3998	822.1912	483.1299
		4	12763.4114	9017.4377	4982.6225	2492.4036	1435.236	1763.522
		8	19266.0398	15190.1652	7605.5295	4254.2871	1609.2473	1944.1251
	128/4096	1	2415.1981
		2	5671.9561	3518.782	419.0178	272.9137
		4	8939.8227	6431.2702	3083.8794	1685.9677	1212.5416	1280.3778
		8	13974.2854	9168.709	4981.9765	3067.5452	1310.091	1499.2441
	20000/2000	1	240.7202
		2	614.318	397.6801
		4	1030.9528	851.8542	369.4269	179.5181	126.7676	140.5565
		8	1898.9762	1354.5333		362.9368	156.5767	141.1584
	2048/128	1	458.1948	244.1842
		2	692.3911	697.3907	322.7016	144.7921	95.0306
		4	1034.5773	1001.0771	688.0344	278.4018	150.6795	169.0386
		8	2070.8157	1966.6072	1316.3086	550.4751	142.6166	163.6749
	2048/2048	1	1797.6743	209.1707
		2	3518.0774	2445.0093	166.792	126.1127
		4	6112.9026	4838.5272	2393.1359	1231.0359	823.4777	876.2254
		8	9716.1934	7434.8117	4023.6978	2171.5323	858.6602	1001.3649
	500/2000	1	2826.6665
		2	6106.5855	4605.9226	700.5415	430.6129
		4	10816.8283	9205.3766	3781.082	2096.2441	1176.418	1470.0826
		8	17693.705	13109.4437	6205.2658	3486.7891	1306.35	1639.2778
	5000/500	1	533.6128	125.4236
		2	936.7014	886.6758	228.874	116.9529	76.1601
		4	1386.4827	1313.893	849.1091	362.9361	209.2045	236.117
		8	2711.5057	2444.9643	1420.5163	670.3742	203.8008	230.3084
LLaMA v3.1 8B	1000/1000	1	16414.6988	14108.0361	7054.5156	3634.3886	3165.3542	3726.7552
	128/128	1	27778.8885	26933.1886	15571.6549	6701.7958	5338.0166	8639.7933
	128/2048	1	22948.5383	18995.2523	9150.7477	4963.4443	4250.6391	5101.6652
	128/4096	1	15583.3035	11815.449	5368.9227	3011.3335	2568.5398	2774.5363
	20000/2000	1	1649.5453	1301.4754	562.8735	316.533	291.4776	270.5404
	2048/128	1	3619.4309	3460.3545	1904.3259	795.389	611.8446	986.9134
	2048/2048	1	11032.9729	8777.6623	4159.6857	2264.9513	2011.1215	2018.303
	500/2000	1	19510.4015	14993.328	7498.3331	3945.1912	3374.7133	4065.3921
	5000/500	1	3787.6721	3258.2001	1708.0353	790.6631	703.56	855.9822
Mistral 7B	1000/1000	1	17739.1436	14986.7562	7697.1418	3804.5585	3333.4754	3981.4799
	128/128	1	30094.9137	29341.284	16238.937	6914.2184	5491.7418	9127.5052
	128/2048	1	24671.5477	20941.6631	9708.1161	5303.4318	4402.3044	5357.3405
	128/4096	1	16454.0833	12780.3724	5800.4957	3235.0678	2825.7896	2879.9833
	20000/2000	1	1676.0415	1317.9654	569.7589	324.5936	281.4751	286.353
	2048/128	1	3649.1462	3492.3042	1929.3126	800.9286	617.0932	1019.75
	2048/2048	1	11403.6968	8974.7383	4367.8733	2331.8112	1988.3496	2184.3861
	500/2000	1	20819.4592	15992.3357	7947.4257	4189.395	3603.4489	4286.3867
	5000/500	1	3840.0108	3340.7385	1707.2611	807.4561	722.8385	881.7336
Mixtral 8x22B	1000/1000	8	18557.43	16918.03	9759.888	4753.6273		2128.4403
	128/128	8	25179.4765	23729.5293	16421.3182	6948.5923		2488.6297
	128/2048	8	27492.4926	24556.7807	12303.4168	7246.7172		3540.0067
	128/4096	8	19718.8648	17755.0018	7474.3817	4696.6123		2568.3114
	20000/2000	8	2897.182	2189.606	1118.8294	594.8509		309.0799
	2048/128	8	3093.8418	2917.1362	1994.0127	825.3934		294.7706
	2048/2048	8	13795.9827	12487.6502	5857.8831	3377.8371		1694.6176
	500/2000	8	24637.473	19997.3914	10637.6598	6007.619		2976.9633
	5000/500	8	3889.2745	3578.4843	2211.2377	1028.3843		420.2156
Mixtral 8x7B	1000/1000	2	18712.2046	15931.8663	6052.876	3276.6186	1907.8817
		4	32834.0923	28015.1981	15509.1538	7357.1613	4737.0179	5060.8399
		8	44410.7533	40573.0499	27684.9381	13948.1533	4970.9287	5725.9638
	128/128	2	24970.5594	24321.9927	15334.2103	5915.3897	3810.1846
		4	42500.5855	40182.7271	27718.9857	11328.7486	6026.9206	6769.9441
		8	54304.0436	51030.9048	40119.3268	17918.1146	5573.7682	6422.4308
	128/2048	2	29314.1475	20945.7816	7409.9253	4284.3035	2248.1815
		4	52680.8353	40668.5928	21293.1761	10929.0182	7353.7405	7506.7612
		8	70409.1968	64529.9982	40839.3077	21058.2144	8866.251	9907.6896
	128/4096	2	21520.4385	12070.6724	3928.6678	2302.964	1171.966
		4	32550.5267	29120.2002	11678.0071	6538.1511	5176.9632	4958.7004
		8	40373.4857	36357.7861	21628.821	13565.7778	7209.2336	8271.7938
	20000/2000	2	2204.1378	1659.5907	622.2717	321.9839	185.6671
		4	4047.7473	3290.9457	1602.0208	778.7285	572.4282	587.1759
		8	6561.6849	5328.5261	3113.2047	1645.8114	750.5372	828.8471
	2048/128	2	2958.0873	2883.5166	1796.5451	687.7251	465.1585
		4	5229.8744	4972.6818	3354.994	1351.7191	728.4943	812.0143
		8	7030.9766	6532.721	5025.3047	2248.6418	677.9886	771.3656
	2048/2048	2	13842.834	9334.0732	3503.0218	1997.1923	1060.8946
		4	22389.4914	20185.8212	9143.2741	4963.8758	3520.3659	3453.8076
		8	28975.322	26176.9163	19291.8278	10552.9732	4590.187	4929.7228
	500/2000	2	23459.0411	18185.6392	6023.3308	3438.6964	1817.11
		4	39971.0236	31693.8787	17087.037	8930.3495	6117.5624	6434.9178
		8	60721.462	48842.8084	31358.2791	17034.706	7118.0767	8130.8026
	5000/500	2	3742.5293	3563.8228	1648.9041	733.1921	448.6716
		4	6602.3877	6020.6267	3543.6819	1603.8223	948.0567	1047.3212
		8	8862.8164	8214.9445	5968.7734	2813.1531	969.817	1098.3081

TP stands for Tensor Parallelism

Reproducing Benchmarked Results

[!NOTE] The only models supported in this workflow are those listed in the table above.

The following tables are references for commands that are used as part of the benchmarking process. For a more detailed description of this benchmarking workflow, see the benchmarking suite documentation.

Commands

Stage	Description	Command
Dataset	Create a synthetic dataset	`python benchmarks/cpp/prepare_dataset.py --tokenizer=$model_name --stdout token-norm-dist --num-requests=$num_requests --input-mean=$isl --output-mean=$osl --input-stdev=0 --output-stdev=0 > $dataset_file`
Build	Build a TensorRT-LLM engine	`trtllm-bench --model $model_name build --tp_size $tp_size --quantization FP8 --dataset $dataset_file`
Run	Run a benchmark with a dataset	`trtllm-bench --model $model_name throughput --dataset $dataset_file --engine_dir $engine_dir`

Variables

Name	Description
`$isl`	Benchmark input sequence length.
`$osl`	Benchmark output sequence length.
`$tp_size`	Number of GPUs to run the benchmark with
`$engine_dir`	Location to store built engine file (can be deleted after running benchmarks).
`$model_name`	HuggingFace model name eg. meta-llama/Llama-2-7b-hf or use the path to a local weights directory
`$dataset_file`	Location of the dataset file generated by `prepare_dataset.py`
`$num_requests`	The number of requests to generate for dataset generation
`$seq_len`	A sequence length of ISL + OSL

Preparing a Dataset

In order to prepare a dataset, you can use the provided script. To generate a synthetic dataset, run the following command:

python benchmarks/cpp/prepare_dataset.py --output=$dataset_file --tokenizer=$model_name token-norm-dist --num-requests=$num_requests --input-mean=$isl --output-mean=$osl --input-stdev=0 --output-stdev=0 > $dataset_file

The command will generate a text file located at the path specified $dataset_file where all requests are of the same input/output sequence length combinations. The script works by using the tokenizer to retrieve the vocabulary size and randomly sample token IDs from it to create entirely random sequences. In the command above, all requests will be uniform because the standard deviations for both input and output sequences are set to 0.

For each input and output sequence length combination, the table below details the $num_requests that were used. For shorter input and output lengths, a larger number of messages were used to guarantee that the system hit a steady state because requests enter and exit the system at a much faster rate. For longer input/output sequence lengths, requests remain in the system longer and therefore require less requests to achieve steady state.

Input Length	Output Length	$seq_len	$num_requests
128	128	256	30000
128	2048	2176	3000
128	4096	4224	1500
2048	128	2176	3000
2048	2048	4096	1500
5000	500	5500	1500
1000	1000	2000	3000
500	2000	2500	3000
20000	2000	22000	1000

Engine Building

All engines are built using the trtllm-bench build sub-command. The basic command for FP8 quantized engines is as follows:

trtllm-bench --model $model_name build --tp_size $tp_size --quantization FP8 --dataset $dataset_file

or if you would like to build for a specific sequence length:

trtllm-bench --model $model_name build --tp_size $tp_size --quantization FP8 --max_seq_length $seq_len

If you would like to build an FP16 engine without any quantization, simply remove the --quantization FP8 option.

[!NOTE] If you specify FP8 quantization, the KV cache will automatically be set to FP8 as well!

The trtllm-bench build sub-command will output the path where the engine is located upon a successful build. For example,

===========================================================
ENGINE SAVED: /tmp/meta-llama/Llama-2-7b-hf/tp_1_pp_1
===========================================================

Running the Benchmark

To run the benchmark with the generated data set, simply use the trtllm-bench throughput sub-command. The benchmarker will run an offline maximum throughput scenario such that all requests are queued in rapid succession. You simply need to provide the patch to the engine from the build phase and a generated dataset.

trtllm-bench --model $model_name throughput --dataset $dataset_file --engine_dir $engine_dir

The results will be printed to the terminal upon benchmark completion. For example,

===========================================================
= ENGINE DETAILS
===========================================================
Model:                  meta-llama/Llama-2-7b-hf
Engine Directory:       /tmp/meta-llama/Llama-2-7b-hf/tp_1_pp_1
TensorRT-LLM Version:   0.12.0
Dtype:                  float16
KV Cache Dtype:         FP8
Quantization:           FP8
Max Input Length:       2048
Max Sequence Length:    4098

===========================================================
= WORLD + RUNTIME INFORMATION
===========================================================
TP Size:                1
PP Size:                1
Max Runtime Batch Size: 4096
Max Runtime Tokens:     8192
Scheduling Policy:      Guaranteed No Evict
KV Memory Percentage:   99.0%
Issue Rate (req/sec):   3.680275266452667e+18
===========================================================
= STATISTICS
===========================================================
Number of requests:             3000
Average Input Length (tokens):  128.0
Average Output Length (tokens): 128.0
Token Throughput (tokens/sec):  23405.927228471104
Request Throughput (req/sec):   182.8588064724305
Total Latency (seconds):        16.406100739
===========================================================

[!WARNING] In some cases, the benchmarker may not print anything at all. This behavior usually means that the benchmark has hit an out of memory issue. Try reducing the KV cache percentage using the --kv_cache_free_gpu_mem_fraction option to lower the percentage of used memory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf-overview.md

perf-overview.md

Overview

Known Issues

Fused Matmul + Gated-SiLU (LLaMA)

Throughput Measurements

Reproducing Benchmarked Results

Commands

Variables

Preparing a Dataset

Engine Building

Running the Benchmark

Files

perf-overview.md

Latest commit

History

perf-overview.md

File metadata and controls

Overview

Known Issues

Fused Matmul + Gated-SiLU (LLaMA)

Throughput Measurements

Reproducing Benchmarked Results

Commands

Variables

Preparing a Dataset

Engine Building

Running the Benchmark