Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
benchmark.py		benchmark.py

README.md

LLaMA

This document shows how to run a quick benchmark of LLaMA model with TensorRT-LLM on a single GPU, single node windows machine.

Overview

The TensorRT-LLM LLaMA example code is located in examples/llama and contains detailed instructions on how to build TensorRT engine(s) and perform inference using the the LLaMA model. Please consult the instructions in that folder for details.

Please refer to these instructions for AWQ weight quantization and engine generation. Sample int4 AWQ quantized weights can be downloaded from llama2-7b-int4-chat, llama2-13b-int4-chat, code-llama-13b-int4-instruct, mistral-7b-int4-chat.

Rather, here we showcase how to run a quick benchmark using the provided benchmark.py script. This script builds, runs, and benchmarks an INT4-GPTQ quantized LLaMA model using TensorRT.

pip install pydantic pynvml
python benchmark.py --model_dir .\tmp\llama\7B\ --quant_ckpt_path .\llama-7b-4bit-gs128.safetensors --engine_dir .\engines

Here model_dir is the path to the LLaMA HF model, quant_ckpt_path is the path to the quantized weights file and engine_dir is the path where the generated engines and other artefacts are stored. Please check the instructions here to generate a quantized weights file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama

llama

README.md

LLaMA

Overview

Files

llama

Directory actions

More options

Directory actions

More options

Latest commit

History

llama

Folders and files

parent directory

README.md

LLaMA

Overview