LLaMA3-Quantization

LLaMA3-Quantization is the official implementation of our paper How Good Are Low-bit Quantized LLAMA3 Models? An Empirical Study [PDF]. Created by researchers from The University of Hong Kong, Beihang University and ETH Zürich.

Introduction

Meta's LLaMa family has become one of the most powerful open-source Large Language Model (LLM) series. Notably, LLaMa3 models have recently been released and achieve impressive performance across various with super-large scale pre-training on over 15T tokens of data. Given the wide application of low-bit quantization for LLMs in resource-limited scenarios, we explore LLaMa3's capabilities when quantized to low bit-width. This exploration holds the potential to unveil new insights and challenges for low-bit quantization of LLaMa3 and other forthcoming LLMs, especially in addressing performance degradation problems that suffer in LLM compression. Specifically, we evaluate the 10 existing post-training quantization and LoRA-finetuning methods of LLaMa3 on 1-8 bits and diverse datasets to comprehensively reveal LLaMa3's low-bit quantization performance. Our experiment results indicate that LLaMa3 still suffers non-negligent degradation in these scenarios, especially in ultra-low bit-width. This highlights the significant performance gap under low bit-width that needs to be bridged in future developments. We expect that this empirical study will prove valuable in advancing future models, pushing the LLMs to lower bit-width with higher accuracy for being practical. Our project is released on https://github.com/Macaronlin/LLaMA3-Quantization and quantized LLaMa3 models are released in https://huggingface.co/LLMQ.

Usage

We provide full script to evaluate various quantization methods in ./scripts/. We use LLaMa-3-8B in IR-QLoRA method as an example here:

python main.py \ 
    --model meta-llama/Meta-Llama-3-8B  \ 
    --peft LLMQ/LLaMA-3-8B-IR-QLoRA \ 
    --tau_range 0.1 --tau_n 100--blocksize 256 \ 
    --epochs 0 \ 
    --output_dir ./log/llama-3-8b-irqlora \ 
    --wbits 4 \ 
    --tasks piqa,arc_easy,arc_challenge,hellaswag,winogrande

Results

Track1: Post-Training Quantization

Evaluation results of post-training quantization on LLAMA3-8B model.
Evaluation results of post-training quantization on LLAMA3-70B model.

Track2: LoRA-FineTuning Quantization

LoRA-FT on LLAMA3-8B with Alpaca dataset.

Related Project

QUIP

GPTQ: Accurate Post-training Compression for Generative Pretrained Transformers

AutoGPTQ

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

RPTQ: Reorder-Based Post-Training Quantization for Large Language Models

OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models

PB-LLM: Partially Binarized Large Language Models

BiLLM: Pushing the Limit of Post-Training Quantization for LLMs

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

QLoRA: Efficient Finetuning of Quantized LLMs

IR-QLoRA: Accurate LoRA-Finetuning Quantization of LLMs via Information Retention

Name	Name	Last commit message	Last commit date
Latest commit Aaronhuang-778 Add files via upload Aug 9, 2024 6733d98 · Aug 9, 2024 History 9 Commits
images	images	Add files via upload	Aug 9, 2024
lm_eval	lm_eval	update	Apr 22, 2024
models	models	update	Apr 22, 2024
quant	quant	update	Apr 22, 2024
scripts	scripts	Create eval_fake_ptq.sh	Apr 25, 2024
.gitignore	.gitignore	update	Apr 22, 2024
README.md	README.md	Update README.md	May 27, 2024
categories.py	categories.py	init and update irqlora	Apr 21, 2024
datautils.py	datautils.py	init and update irqlora	Apr 21, 2024
gptq.py	gptq.py	update	Apr 22, 2024
irqlora.py	irqlora.py	init and update irqlora	Apr 21, 2024
llama.py	llama.py	update	Apr 22, 2024
main.py	main.py	init and update irqlora	Apr 21, 2024
parallel_utils.py	parallel_utils.py	init and update irqlora	Apr 21, 2024
utils.py	utils.py	init and update irqlora	Apr 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLaMA3-Quantization

Introduction

Usage

Results

Track1: Post-Training Quantization

Track2: LoRA-FineTuning Quantization

Related Project

About

Releases

Packages

Contributors 4

Languages

Macaronlin/LLaMA3-Quantization

Folders and files

Latest commit

History

Repository files navigation

LLaMA3-Quantization

Introduction

Usage

Results

Track1: Post-Training Quantization

Track2: LoRA-FineTuning Quantization

Related Project

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages