This branch is 30 commits ahead of, 158 commits behind triton.

Name		Name	Last commit message	Last commit date
Latest commit qwopqwop200 Merge pull request #256 from menandro/mycuda Jun 6, 2023 cad95ae · Jun 6, 2023 History 347 Commits
README.md		README.md
convert_llama_weights_to_hf.py		convert_llama_weights_to_hf.py
datautils.py		datautils.py
fused_attn.py		fused_attn.py
gptq.py		gptq.py
llama.py		llama.py
llama_inference.py		llama_inference.py
llama_inference_dmapauto.py		llama_inference_dmapauto.py
llama_inference_offload.py		llama_inference_offload.py
modelutils.py		modelutils.py
opt.py		opt.py
quant.py		quant.py
quant_cuda.cpp		quant_cuda.cpp
quant_cuda_kernel.cu		quant_cuda_kernel.cu
requirements.txt		requirements.txt
setup_cuda.py		setup_cuda.py
share_tensors_across_processes.py		share_tensors_across_processes.py
test_kernel.py		test_kernel.py

Repository files navigation

GPTQ-for-LLaMA

4 bits quantization of LLaMA using GPTQ

GPTQ is SOTA one-shot weight quantization method

This code is based on GPTQ

Changed to support new features proposed by GPTQ.

Slightly adjusted preprocessing of C4 and PTB for more realistic evaluations (used in our updated results); can be activated via the flag --new-eval.
two new tricks:--act-order (quantizing columns in order of decreasing activation size) and --true-sequential (performing sequential quantization even within a single Transformer block). Those fix GPTQ's strangely bad performance on the 7B model (from 7.15 to 6.09 Wiki2 PPL) and lead to slight improvements on most models/settings in general.

It supports act-order, but it's very slow.

Result

LLaMA-7B(click me)

LLaMA-7B	Bits	group-size	memory(MiB)	Wikitext2	checkpoint size(GB)
FP16	16	-	13940	5.68	12.5
RTN	4	-	-	6.29	-
GPTQ	4	-	4740	6.09	3.5
GPTQ	4	128	4891	5.85	3.6
RTN	3	-	-	25.54	-
GPTQ	3	-	3852	8.07	2.7
GPTQ	3	128	4116	6.61	3.0

LLaMA-13B

LLaMA-13B	Bits	group-size	memory(MiB)	Wikitext2	checkpoint size(GB)
FP16	16	-	OOM	5.09	24.2
RTN	4	-	-	5.53	-
GPTQ	4	-	8410	5.36	6.5
GPTQ	4	128	8747	5.20	6.7
RTN	3	-	-	11.40	-
GPTQ	3	-	6870	6.63	5.1
GPTQ	3	128	7277	5.62	5.4

LLaMA-33B

LLaMA-33B	Bits	group-size	memory(MiB)	Wikitext2	checkpoint size(GB)
FP16	16	-	OOM	4.10	60.5
RTN	4	-	-	4.54	-
GPTQ	4	-	19493	4.45	15.7
GPTQ	4	128	20570	4.23	16.3
RTN	3	-	-	14.89	-
GPTQ	3	-	15493	5.69	12.0
GPTQ	3	128	16566	4.80	13.0

LLaMA-65B

LLaMA-65B	Bits	group-size	memory(MiB)	Wikitext2	checkpoint size(GB)
FP16	16	-	OOM	3.53	121.0
RTN	4	-	-	3.92	-
GPTQ	4	-	OOM	3.84	31.1
GPTQ	4	128	OOM	3.65	32.3
RTN	3	-	-	10.59	-
GPTQ	3	-	OOM	5.04	23.6
GPTQ	3	128	OOM	4.17	25.6

Quantization requires a large amount of CPU memory. However, the memory required can be reduced by using swap memory.

Depending on the GPUs/drivers, there may be a difference in performance, which decreases as the model size increases.(IST-DASLab/gptq#1)

According to GPTQ paper, As the size of the model increases, the difference in performance between FP16 and GPTQ decreases.

Installation

If you don't have conda, install it first.

conda create --name gptq python=3.9 -y
conda activate gptq
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
# Or, if you're having trouble with conda, use pip with python3.9:
# pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117

git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa
cd GPTQ-for-LLaMa
pip install -r requirements.txt
python setup_cuda.py install

# Benchmark performance for FC2 layer of LLaMa-7B
CUDA_VISIBLE_DEVICES=0 python test_kernel.py

Dependencies

torch: tested on v2.0.0+cu117
transformers: tested on v4.28.0.dev0
datasets: tested on v2.10.1
safetensors: tested on v0.3.0
(to run 4-bit kernels: setup for compiling PyTorch CUDA extensions, see also https://pytorch.org/tutorials/advanced/cpp_extension.html, tested on CUDA 11.7)

All experiments were run on a single NVIDIA RTX3090.

Language Generation

LLaMA

#convert LLaMA to hf
python convert_llama_weights_to_hf.py --input_dir /path/to/downloaded/llama/weights --model_size 7B --output_dir ./llama-hf

# Benchmark language generation with 4-bit LLaMA-7B:

# Save compressed model
CUDA_VISIBLE_DEVICES=0 python llama.py ./llama-hf/llama-7b c4 --wbits 4 --true-sequential --act-order --groupsize 128 --save llama7b-4bit-128g.pt
# Or save compressed `.safetensors` model
CUDA_VISIBLE_DEVICES=0 python llama.py ./llama-hf/llama-7b c4 --wbits 4 --true-sequential --act-order --groupsize 128 --save_safetensors llama7b-4bit-128g.safetensors

# Benchmark generating a 2048 token sequence with the saved model
CUDA_VISIBLE_DEVICES=0 python llama.py ./llama-hf/llama-7b c4 --wbits 4 --groupsize 128 --load llama7b-4bit-128g.pt --benchmark 2048 --check
# Benchmark FP16 baseline, note that the model will be split across all listed GPUs
CUDA_VISIBLE_DEVICES=0,1,2,3,4 python llama.py ./llama-hf/llama-7b c4 --benchmark 2048 --check

# model inference with the saved model
CUDA_VISIBLE_DEVICES=0 python llama_inference.py ./llama-hf/llama-7b --wbits 4 --groupsize 128 --load llama7b-4bit-128g.pt --text "this is llama"
# model inference with the saved model using safetensors loaded direct to gpu
CUDA_VISIBLE_DEVICES=0 python llama_inference.py ./llama-hf/llama-7b --wbits 4 --groupsize 128 --load llama7b-4bit-128g.safetensors --text "this is llama --device=0
# model inference with the saved model with offload(This is very slow. This is a simple implementation and could be improved with technologies like flexgen(https://github.com/FMInference/FlexGen).
CUDA_VISIBLE_DEVICES=0 python llama_inference_offload.py ./llama-hf/llama-7b --wbits 4 --groupsize 128 --load llama7b-4bit-128g.pt --text "this is llama" --pre_layer 16
It takes about 180 seconds to generate 45 tokens(5->50 tokens) on single RTX3090 based on LLaMa-65B. pre_layer is set to 50.

Basically, 4-bit quantization and 128 groupsize are recommended.

Acknowledgements

This code is based on GPTQ

Thanks to Meta AI for releasing LLaMA, a powerful LLM.

Triton GPTQ kernel code is based on GPTQ-triton

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPTQ-for-LLaMA

Result

Installation

Dependencies

Language Generation

LLaMA

Acknowledgements

About

Releases

Packages

Contributors 26

Languages

License

qwopqwop200/GPTQ-for-LLaMa

Folders and files

Latest commit

History

Repository files navigation

GPTQ-for-LLaMA

Result

Installation

Dependencies

Language Generation

LLaMA

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 26

Languages

Packages