This document shows how to run a quick benchmark of LLaMA model with TensorRT-LLM on a single GPU, single node windows machine.
The TensorRT-LLM LLaMA example code is located in examples/llama
and contains detailed instructions on how to build TensorRT engine(s) and perform inference using the the LLaMA model. Please consult the instructions in that folder for details.
Please refer to these instructions for AWQ weight quantization and engine generation. Sample int4 AWQ quantized weights can be downloaded from llama2-7b-int4-chat, llama2-13b-int4-chat, code-llama-13b-int4-instruct, mistral-7b-int4-chat.
Rather, here we showcase how to run a quick benchmark using the provided benchmark.py
script. This script builds, runs, and benchmarks an INT4-GPTQ quantized LLaMA model using TensorRT.
pip install pydantic pynvml
python benchmark.py --model_dir .\tmp\llama\7B\ --quant_ckpt_path .\llama-7b-4bit-gs128.safetensors --engine_dir .\engines
Here model_dir
is the path to the LLaMA HF model, quant_ckpt_path
is the path to the quantized weights file and engine_dir
is the path where the generated engines and other artefacts are stored. Please check the instructions here to generate a quantized weights file.