TesseraQ is a block reconstruction-based PTQ algorithm for Large Language Models, achieving state-of-the-art uniform quantization performance under INT2/INT3/INT4 format.
-
Oct 28, 2024: 🍺🍺🍺 We release our arXiv paper:
TesseraQ: Ultra Low-Bit LLM Post-Training Quantization with Block Reconstruction.
-
Oct 26, 2024: Our method is integrated into LLMC framework.
- We integrate our method into the LLMC framework, which can easily combined with existing PTQ algorithm them. It also allows easy comparisons across different algorithms/models.💥
- Easy initialization from AWQ, OmniQuant, or QuaRot for both weight-only quantization, and weight-activation quantization. 💥
- Compatible to original LLMC to export TesseraQ models to various quantization backends, including
Huggingface
,LightLLM
,AutoAWQ
,vLLM
,GPTQModel
for reduced memory footprint and faster inference. 💥
Our method has also been integrated into the official release of LLMC, feel free to use our method there!
-
Clone this repository and install packages:
# install packages cd llmc pip install -r requirements.txt
-
Prepare models and data.
# After downloading LLMs from huggingface, prepare calibration and evaluation data as follows: cd tools python download_calib_dataset.py --save_path [calib data path] python download_eval_dataset.py --save_path [eval data path]
-
Choose a model and quantize it with TesseraQ:
# Here's an example about LLaMA-2-7B model with W2A16g128 quantization: cd scripts # Modify the path of llmc, ``llmc_path``, in the bash file. You can also choose one config # placed in ``llmc/configs/quantization/Awq/`` to quantize your model, or your own # config referring to those we provide by changing the ``--config`` argument in run_awq_llama.sh. bash run_awq_llama.sh bash run_tesseraq_llama.sh
We provide the running scripts to reproduce our experiments.
cd scripts
sh run_llama2.sh
Model | 7B | 13B | 70B |
---|---|---|---|
W2A16 | 8.05 | 6.55 | 5.26 |
W2A16g128 | 6.82 | 5.92 | 4.73 |
W2A16g64 | 6.67 | 5.81 | 4.60 |
W3A16 | 5.84 | 5.16 | 3.68 |
W3A16g128 | 5.71 | 5.11 | 3.61 |
W4A16 | 5.56 | 4.96 | 3.40 |
(Note that the above srcipts can also be used to reproduce LLaMA-7B/13B/30B/66B models)
cd scripts
sh run_llama3_1.sh
Model | 8B | 70B |
---|---|---|
W2A16g128 | 59.37 | 66.76 |
W3A16g128 | 67.36 | 74.09 |
cd scripts
sh run_llama3_2.sh
Model | Method | Bit | Wiki ppl. | Avg. Acc | Scripts |
---|---|---|---|---|---|
LLaMA-3.2-1B | Pretrain | FP16 | 9.75 | 56.50 | - |
LLaMA-3.2-1B | AWQ | W2g128 | 5475 | 35.42 | here |
LLaMA-3.2-1B | TesseraQ | W2g128 | 18.61 | 43.36 | here |
LLaMA-3.2-1B | AWQ | W3g128 | 16.69 | 49.85 | here |
LLaMA-3.2-1B | TesseraQ | W3g128 | 11.08 | 53.24 | here |
LLaMA-3.2-1B | AWQ | W4g128 | 10.85 | 54.68 | here |
LLaMA-3.2-1B | TesseraQ | W4g128 | 10.09 | 54.98 | here |
LLaMA-3.2-3B | Pretrain | FP16 | 7.81 | 63.57 | - |
LLaMA-3.2-3B | AWQ | W2g128 | 495.2 | 38.15 | here |
LLaMA-3.2-3B | TesseraQ | W2g128 | 11.94 | 51.53 | here |
LLaMA-3.2-3B | AWQ | W3g128 | 10.21 | 59.94 | here |
LLaMA-3.2-3B | TesseraQ | W3g128 | 8.45 | 61.58 | here |
LLaMA-3.2-3B | AWQ | W4g128 | 8.25 | 62.83 | here |
LLaMA-3.2-3B | TesseraQ | W4g128 | 7.96 | 63.63 | here |
To help users design their configs, we now explain some universal configurations in all configs we provide under llmc/configs/
:
-
model
:model: # Replace by the name of the class in ``llmc/models/*.py``. type: Llama # We set the path to LLaMA-2-7B. path: meta-llama/Llama-2-7b-hf torch_dtype: auto
-
calib
:# Note: some algorithms do not need ``calib``, like naive... So, you can remove this part. calib: # Replace by the calibration data name, e.g., pileval, c4, wikitext2, or ptb, downloaded before. name: c4 download: False # Replace by the path of one of the calibration data, e.g., pileval, c4, wikitext2, or ptb, # downloaded before. path: calib data path # ../cache/data/calib/c4 n_samples: 512 bs: 1 seq_len: 2048 # Replace by the function name in ``llmc/data/dataset/specified_preproc.py``. preproc: c4_gptq seed: *seed
-
eval
:# If you want to evaluate PPL of your pretrained/transformed/fake_quant model. eval: # You can evaluate the pretrain, transformed, fake_quant model, and set the position # you want to evaluate. eval_pos: [pretrain, transformed, fake_quant] # Replace by the name of the eval data, e.g., c4, wikitext2, ptb or [c4, wikitext2], # downloaded before. name: [wikitext2, c4] download: False path: eval data path # For 70B model eval, bs can be set to 20, and inference_per_block can be set to True. # For 7B / 13B model eval, bs can be set to 1, and inference_per_block can be set to False. bs: 1 inference_per_block: False seq_len: 2048
-
save
:save: # ``save_fp`` is True, which means you want to export the transformed model, e.g., parameter-modified # model, whose performance and structure are the same as the original model, and users can # utilize naive quantization to the transformed model to obtain the same performance as # the specifical-algorithm-quantized model. save_fp: False # ``save_lightllm`` is True, which means you want to export a real quant model, e.g., # low-bit weights with weight and activation quantization parameters. save_lightllm: False # ``save_fake`` is True means you want to export fake_quant model, e.g., # dequantized weight with activation quantization parameters. save_fake: False save_path: ./save
-
quant
:quant: # Replace by the class name in ``llmc/compression/quantization/*.py`` method: TesseraQ # weight-only quantization does not have ``act`` part. weight: bit: 2 symmetric: False # Quantization granularity: per_channel, per_tensor, per_head (not recommanded). granularity: per_group group_size: 128 # set to -1 if per_channel # Calibration algorithms: learnble, mse, and minmax (default). calib_algo: minmax # specify an `act` quantization configuration here for weight activation quantization # This part is designed for specific algorithms, thus we define the TesseraQ calibration parameters here special: lr: 0.001 # learning rate for rounding variables iterations: 250 # training iterations for each round of PAR wd: 0.0 # weight decay, set to 0 batch_size: 4 # batch size for calibration deactive_amp: False # use fp16 training if False aug_loss: False # legacy parameters from OmniQuant, always False optimize_scale: True # enabling dequantize scale tuning scale_lr: 0.001 # learning rate for dequantize scale tuning, set to same with rounding # handcrafted thresholds change during PAR thresholds: [0.8, 0.65, 0.5, 0.43, 0.38, 0.34, 0.3, 0.27, 0.24, 0.21, 0.18, 0.15, 0.12, 0.10, 0.08, 0.06, 0.04, 0.02, 0.01, 0.005] weight_clip: True # for online clipping weights or loading AWQ/OMNIQUANT clips load_transform: True # for online loading AWQ scale transformations # used together with `weight_clip`, v1 is for online clipping, v2 is for loading pretrained clips. clip_version: v1 reduce_memory: True # restore block to fp16 after calibration, help reduce CPU memory # path to saved transformation scales or weight clipping values scale_path: ../cache/activations/L2_7b/awq_w2g128 clip_path: ../cache/activations/L2_7b/awq_w2g128 # parameters with QuaRot initialization, set to True if use QuaRot models online_rotate: False fp32_had: False # If quant_out is True, employ the outputs of the former quantized block as the # calibration data of the proceeding block. quant_out: True # always True for TesseraQ
There are two ways to apply AWQ initialization for TesseraQ. The first one is saving the AWQ transformation/scales and then apply them on the fly before TesseraQ calibration in each block. The second method is to directly save the transformed LLM checkpoint and reload it for TesseraQ.
For the first method, please set the save_scale
and save_clip
to True and specify the saved path for them in the AWQ
configuration, for example:
save_scale: True
clip_version: v2
scale_path: ../cache/activations/L2_7b/awq_w2g128
save_clip: True
clip_path: ../cache/activations/L2_7b/awq_w2g128
Then, for the TesseraQ configuration, enable load transform
and weight_clip
and specify the saved path of clips/scales.
weight_clip: True
load_transform: True
clip_version: v2
Note that when clip_version=v2
, the calib_algo
of weight quantization should be set to learanable
.
If we choose clip_version=v1
, TesseraQ will perform AWQ weight clipping on-the-fly instead of load the saved clips,
and may achieve better perplexity results in low bit case.
For the second method, be aware to use clip_version=v1
and then simply enable the save_transformed
in AWQ configuration.
Next, we change the path model/path
in TesseraQ configurations without enabling load_transform
or weight_clip
.
Since OmniQuant only optimizes the clipping values in weights for LLaMA weight only quantization. It's easy to use their pretrained values. First, download the pretrained OmniQuant clips here and then specify the parameters in TesseraQ configuration
weight_clip: True
load_transform: False # no scale transformation in OmniQuant-LLaMA
clip_version: v2
clip_path: ../cache/activations/L2_7b/omniq_w2
Note that in most cases we observe AWQ initialization is better than OmniQuant except for W2A16 per-channel quantization.
We recommend saving the QuaRot checkpoint and reload it for TesseraQ quantization since QuaRot transformation can be done once
and used for all bitwidth settings. To do so, simply enable the save_transformed
in QuaRot configuration.
Then load the saved checkpoint for TesseraQ and enable online_rotate
as well as fp32_had
in the configuration.
If you find our TesseraQ paper useful or relevant to your research, please kindly cite our paper:
@misc{li2024tesseraq,
title={TesseraQ: Ultra Low-Bit LLM Post-Training Quantization with Block Reconstruction},
author={Yuhang Li and Priyadarshini Panda},
year={2024},
eprint={2410.19103},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
Also consider cite the LLMC framework paper
@misc{gong2024llmcbenchmarkinglargelanguage,
title={LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit},
author={Ruihao Gong and Yang Yong and Shiqiao Gu and Yushi Huang and Chentao Lv and Yunchen Zhang and Xianglong Liu and Dacheng Tao},
year={2024},
eprint={2405.06001},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2405.06001},
}