Skip to content

Latest commit

 

History

History

llama

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

LLaMA

This document shows how to run a quick benchmark of LLaMA model with TensorRT-LLM on a single GPU, single node windows machine.

Overview

The TensorRT-LLM LLaMA example code is located in examples/llama and contains detailed instructions on how to build TensorRT engine(s) and perform inference using the the LLaMA model. Please consult the instructions in that folder for details.

Please refer to these instructions for AWQ weight quantization and engine generation. Sample int4 AWQ quantized weights can be downloaded from llama2-7b-int4-chat, llama2-13b-int4-chat, code-llama-13b-int4-instruct, mistral-7b-int4-chat.

Rather, here we showcase how to run a quick benchmark using the provided benchmark.py script. This script builds, runs, and benchmarks an INT4-GPTQ quantized LLaMA model using TensorRT.

pip install pydantic pynvml
python benchmark.py --model_dir .\tmp\llama\7B\ --quant_ckpt_path .\llama-7b-4bit-gs128.safetensors --engine_dir .\engines

Here model_dir is the path to the LLaMA HF model, quant_ckpt_path is the path to the quantized weights file and engine_dir is the path where the generated engines and other artefacts are stored. Please check the instructions here to generate a quantized weights file.