Skip to content

Latest commit

 

History

History
 
 

bert

BERT and BERT Variants

This document explains how to build the BERT family, specifically BERT and RoBERTa model using TensorRT-LLM. It also describes how to run on a single GPU and two GPUs.

Overview

The TensorRT-LLM BERT family implementation can be found in tensorrt_llm/models/bert/model.py. The TensorRT-LLM BERT family example code is located in examples/bert. There are two main files in that folder:

Convert Weights

The convert_checkpoint.py script converts weights from HuggingFace format to TRT-LLM format. You need to prepare HuggingFace checkpoint files before you run the convert script.

Use --model_dir to specify the HuggingFace checkpoint directory.

Supported model_name options include: BertModel, BertForQuestionAnswering, BertForSequenceClassification. Use --model to specify the target BERT model. Please note that if you choose BertModel, convert_checkpoint.py will ignore the BERT model class specified in HuggingFace config file, and throw a warning to remind you.

Use --output_dir to specify the converted checkpoint and configuration directory. The default value is ./tllm_checkpoint. This directory will be used for next engine building phase.

Take BertForQuestionAnswering for example,

export hf_model_dir=<HuggingFace_Model_Path>
export model_name='bertqa'
export model='BertForQuestionAnswering'
export dtype='float16'

# convert
python convert_checkpoint.py \
--model $model \
--model_dir $hf_model_dir  \
--output_dir ${model_name}_${dtype}_tllm_checkpoint \
--dtype $dtype

# convert tp=2
python  convert_checkpoint.py \
--model $model \
--model_dir $hf_model_dir  \
--output_dir ${model_name}_${dtype}_tllm_checkpoint \
--dtype $dtype \
--tp_size 2

Build TensorRT engine(s)

TensorRT-LLM converts HuggingFace BERT family models into TensorRT engine(s). To build the TensorRT engine, the basic command is:

trtllm-build --checkpoint_dir ./${model_name}_${dtype}_tllm_checkpoint \
--output_dir ${model_name}_engine_outputs \

Beside the basic engine build, TensorRT-LLM provides these features by adding these flags with basic build command:

  • To use bert_attention_plugin, add --bert_attention_plugin to command.

  • To use remove input padding, add --remove_input_padding=enable and --bert_attention_plugin to command. Please note that remove input padding feature has to come with bert attention plugin.

  • To use FMHA kernels, add --context_fmha=enable or --bert_context_fmha_fp32_acc=enable(to enable FP32 accumulation). Note that these two flags should be used with --bert_attention_plugin

Continue the BertForQuestionAnswering example:

# Build TensorRT engine for BertForQuestionAnswering model, with remove_input_padding enabled.
# TP=1 and TP=2 share the same build command
trtllm-build --checkpoint_dir ./${model_name}_${dtype}_tllm_checkpoint  \
--output_dir=${model_name}_engine_outputs \
--remove_input_padding=enable \
--bert_attention_plugin=${dtype} \
--max_batch_size 8 \
--max_input_len 512

Run TensorRT engine(s)

Run a TensorRT-LLM BERT model using the engines generated by build command mentioned above. Note that during model deployment, only the TensorRT engine files are needed. Previously downloaded model checkpoints and converted weights can be removed.

run.py provides an example for performing the inference and decoding the output. By default, it will use the task specific datasets as input text, for example, 'squad_v2' for BertForQuestionAnswering.

To run the TensorRT engine, the basic command is:

run.py --engine_dir ./${model_name}_engine_outputs \
--hf_model_dir $hf_model_dir \ # used for loading tokenizer

Please note that:

  • To use remove input padding, add --remove_input_padding to command. This flag is used to tell the runtime how to process the input and decode the output.

  • To compare the result with HuggingFace model, add --run_hf_test to command. The runtime will load the HF model from hf_model_dir and compare the result. Refer to run.py for more details.

Continue the BertForQuestionAnswering example:

# Run TensorRT engine
python run.py \
--engine_dir ./${model_name}_engine_outputs \
--hf_model_dir=$hf_model_dir \
--remove_input_padding \
--run_hf_test

# Run TP=2 inference
mpirun -n 2 \
python run.py \
--engine_dir ./${model_name}_engine_outputs \
--hf_model_dir=$hf_model_dir \
--remove_input_padding \
--run_hf_test