This document explains how to build the BERT family, specifically BERT and RoBERTa model using TensorRT-LLM. It also describes how to run on a single GPU and two GPUs.
The TensorRT-LLM BERT family implementation can be found in tensorrt_llm/models/bert/model.py
.
The TensorRT-LLM BERT family example code is located in examples/bert
. There are two main files in that folder:
convert_checkpoint.py
to convert the BERT model into tensorrt-llm checkpoint format.run.py
to run the inference on an input text,
The convert_checkpoint.py
script converts weights from HuggingFace format to TRT-LLM format. You need to prepare HuggingFace checkpoint files before you run the convert script.
Use --model_dir
to specify the HuggingFace checkpoint directory.
Supported model_name
options include: BertModel, BertForQuestionAnswering, BertForSequenceClassification. Use --model
to specify the target BERT model. Please note that if you choose BertModel, convert_checkpoint.py
will ignore the BERT model class specified in HuggingFace config file, and throw a warning to remind you.
Use --output_dir
to specify the converted checkpoint and configuration directory. The default value is ./tllm_checkpoint
. This directory will be used for next engine building phase.
Take BertForQuestionAnswering for example,
export hf_model_dir=<HuggingFace_Model_Path>
export model_name='bertqa'
export model='BertForQuestionAnswering'
export dtype='float16'
# convert
python convert_checkpoint.py \
--model $model \
--model_dir $hf_model_dir \
--output_dir ${model_name}_${dtype}_tllm_checkpoint \
--dtype $dtype
# convert tp=2
python convert_checkpoint.py \
--model $model \
--model_dir $hf_model_dir \
--output_dir ${model_name}_${dtype}_tllm_checkpoint \
--dtype $dtype \
--tp_size 2
TensorRT-LLM converts HuggingFace BERT family models into TensorRT engine(s). To build the TensorRT engine, the basic command is:
trtllm-build --checkpoint_dir ./${model_name}_${dtype}_tllm_checkpoint \
--output_dir ${model_name}_engine_outputs \
Beside the basic engine build, TensorRT-LLM provides these features by adding these flags with basic build command:
-
To use
bert_attention_plugin
, add--bert_attention_plugin
to command. -
To use remove input padding, add
--remove_input_padding=enable
and--bert_attention_plugin
to command. Please note that remove input padding feature has to come with bert attention plugin. -
To use FMHA kernels, add
--context_fmha=enable
or--bert_context_fmha_fp32_acc=enable
(to enable FP32 accumulation). Note that these two flags should be used with--bert_attention_plugin
Continue the BertForQuestionAnswering example:
# Build TensorRT engine for BertForQuestionAnswering model, with remove_input_padding enabled.
# TP=1 and TP=2 share the same build command
trtllm-build --checkpoint_dir ./${model_name}_${dtype}_tllm_checkpoint \
--output_dir=${model_name}_engine_outputs \
--remove_input_padding=enable \
--bert_attention_plugin=${dtype} \
--max_batch_size 8 \
--max_input_len 512
Run a TensorRT-LLM BERT model using the engines generated by build command mentioned above. Note that during model deployment, only the TensorRT engine files are needed. Previously downloaded model checkpoints and converted weights can be removed.
run.py
provides an example for performing the inference and decoding the output. By default, it will use the task specific datasets as input text, for example, 'squad_v2' for BertForQuestionAnswering.
To run the TensorRT engine, the basic command is:
run.py --engine_dir ./${model_name}_engine_outputs \
--hf_model_dir $hf_model_dir \ # used for loading tokenizer
Please note that:
-
To use remove input padding, add
--remove_input_padding
to command. This flag is used to tell the runtime how to process the input and decode the output. -
To compare the result with HuggingFace model, add
--run_hf_test
to command. The runtime will load the HF model fromhf_model_dir
and compare the result. Refer torun.py
for more details.
Continue the BertForQuestionAnswering example:
# Run TensorRT engine
python run.py \
--engine_dir ./${model_name}_engine_outputs \
--hf_model_dir=$hf_model_dir \
--remove_input_padding \
--run_hf_test
# Run TP=2 inference
mpirun -n 2 \
python run.py \
--engine_dir ./${model_name}_engine_outputs \
--hf_model_dir=$hf_model_dir \
--remove_input_padding \
--run_hf_test