Skip to content

Advanced Ultra-Low Bitrate Compression Techniques for the LLaMA Family of LLMs

License

Notifications You must be signed in to change notification settings

GreenBitAI/low_bit_llama

Repository files navigation

GreenBit LLaMA

This is GreenBitAI's research code for running 2-bit and 1-bit LLaMA models with extreme compression yet still strong performance, the quantized models are available on the model zoo.

This is meant to be a research demo for the quality of the model. There is no speed-up implemented yet.

Roadmap

Over the next few months, we will continue offering 2-bit and 1-bit versions of LLaMA models. Additionally, we are considering the provision of low-bit versions for other open-source LLMs in the future.

Latest Updates

[12/14/2023] We are happy to release the lossless (<1%) W4A16 01-Yi models (low_bit_yi branch). The 2-bit version will be made open soon.

[10/04/2023] We are happy to release the W2A16 g8/32 TinyLLaMA-1.1B models.

[09/29/2023] We are happy to release the W2A16 g8 LLaMA-1 30B and LLaMA-2 70B models.

[09/12/2023] We are happy to announce the release of the 2-bit LLaMA-2 7B (W2A16 g32/g8) models.

[08/31/2023] We are happy to release the harness benchmarks on 14 zero-shot tasks based on our 2-bit models. Happy trying πŸ˜ƒπŸš€.

[08/16/2023] We are happy to release the 2-bit OpenLLaMA 3B models, which are quantized into 2-bit representation yet still with strong performance πŸ˜ƒβ­.

Pretrained Model

LLM Models Method Bits Groupsize Wikitext2 C4 Checkpoint Size (GiB)
LLaMA-2-70B1 FP16 16 - 3.31 5.70 130
Ours 2 8 3.87 5.96 26.9
LLaMA-1-30B1 FP16 16 - 4.10 5.98 60.5
Ours 2 8 4.75 6.57 12.9
LLaMA-2-7B1 FP16 16 - 5.47 6.97 12.5
GPTQ2 4 128 5.61 7.12 3.6
GPTQ2 2 128 2.2e5 1.7e5 2.2
OmniQuant3 4 128 5.58 7.12 3.8
OmniQuant3 3 128 6.03 7.35 3.2
OmniQuant3 2 128 12.84 17.40 2.2
OmniQuant3 2 64 10.56 13.77 -
Ours 4 32 5.55 7.08 3.7
Ours 2 8 6.09 7.63 2.9
Ours 2 32 7.13 8.67 2.2
LLaMA-1-7B4 FP16 16 - 5.67 7.07 12.5
GPTQ2 4 128 5.85 7.21 3.6
GPTQ2 3 128 6.61 7.85 3.0
OmniQuant3 2 128 10.53 13.89 2.2
Ours 2 32 7.59 8.96 2.2
LLaMA 3B5 FP16 16 - 7.34 9.33 6.8
GPTQ2 4 128 7.54 9.58 1.9
Ours 4 32 7.43 9.51 2.0
Ours 2 8 8.32 10.56 1.5
Ours 2 16 8.92 11.29 1.3
Ours 2 32 9.82 12.14 1.2
TinyLLaMA 1.1B6 FP16 16 - 9.10 10.6 4.0
Ours 2 8 9.99 11.75 0.6
Ours 2 32 12.04 14.27 0.5

Fine-tuned Model

LLM Models Method Bits Checkpoint Size (GiB)
LLaMA-2-70B-Chat1 FP16 16 130
Ours 2 26.9
CodeLLaMA-34B7 FP16 16 63
Ours 2 13.5
CodeLLaMA-34B-Python7 FP16 16 63
Ours 2 13.5
CodeLLaMA-34B-Instruction7 FP16 16 63
Ours - -

Zero-Shot Evaluation

Task Metric TinyLLaMA 1.1B q2g32 TinyLLaMA 1.1B q2g8 LLaMA 3B q2g32 LLaMA 3B q2g16 LLaMA 3B q2g8 LLaMA-1 7B q2g32 LLaMA-2 7B q2g32 LLaMA-2 7B q2g8 LLaMA 1.1B FP16 LLaMA 3B FP16 LLaMA-1 7B FP16
Openbookqa acc 0.152 0.192 0.196 0.238 0.242 0.224 0.246 0.296 0.208 0.27 0.29
ac_norm 0.328 0.338 0.332 0.358 0.362 0.388 0.376 0.4 0.368 0.4 0.41
arc_challenge acc 0.3268 0.2278 0.279 0.2978 0.3148 0.3422 0.3268 0.3618 0.243 0.34 0.39
ac_norm 0.3387 0.273 0.2944 0.3319 0.3345 0.3387 0.3387 0.372 0.288 0.37 0.41
hellawswag acc 0.34 0.3769 0.4238 0.444 0.462 0.4996 0.4961 0.5379 0.403 0.49 0.68
ac_norm 0.4097 0.4711 0.5685 0.5988 0.6242 0.6447 0.6464 0.7014 0.503 0.67 0.73
piqa acc 0.6518 0.6931 0.7024 0.716 0.7291 0.7476 0.7503 0.7715 0.71 0.75 0.78
ac_norm 0.6393 0.6812 0.7116 0.7247 0.7312 0.7443 0.7421 0.7568 0.688 0.76 0.78
arc_easy acc 0.4411 0.5109 0.5997 0.646 0.6528 0.6061 0.6174 0.6254 0.533 0.69 0.68
ac_norm 0.3716 0.412 0.5417 0.58 0.5972 0.4566 0.4781 0.4958 0.43 0.65 0.52
Winogrande acc 0.532 0.5249 0.5683 0.5888 0.6054 0.6283 0.6298 0.6582 0.558 0.62 0.68
boolq acc 0.592 0.6174 0.6281 0.6636 0.6327 0.6425 0.7061 0.7242 0.583 0.68 0.75
truthfulqa_mc mc1 0.2338 0.2277 0.2509 0.2118 0.2252 0.224 0.2313 0.2399 0.228 0.22 0.21
mc2 0.4211 0.406 0.3962 0.3501 0.3625 0.3702 0.3854 0.3795 0.401 0.35 0.34
anli_r1 acc 0.363 0.336 0.337 0.334 0.344 0.331 0.333 0.363 0.354 0.33 0.35
anli_r2 acc 0.331 0.346 0.335 0.332 0.331 0.326 0.349 0.347 0.341 0.32 0.34
anli_r3 acc 0.3758 0.3633 0.3358 0.3383 0.3425 0.3417 0.36 0.3733 0.358 0.35 0.37
wic acc 0.5 0.5 0.4984 0.5094 0.4969 0.4984 0.4953 0.489 0.5 0.48 0.5
rte acc 0.4874 0.4874 0.5596 0.5993 0.5632 0.639 0.6065 0.6426 0.516 0.58 0.56
record f1 0.7608 0.8023 0.8502 0.8625 0.8687 0.8859 0.8872 0.9037 0.82 0.88 0.91
em 0.753 0.7934 0.8427 0.8545 0.8612 0.8781 0.8801 0.8959 0.818 0.89 0.91
Average 0.438 0.4498 0.4881 0.5037 0.5087 0.5122 0.5181 0.5391 0.469 0.528 0.5519
model size GiB 0.5 0.6 1.2 1.3 1.5 2.2 2.2 2.9 4.4 6.8 12.5

Requirements

The inference currently requires a machine with CUDA installed. Then you can simply run:

pip install -r requirements.txt

Try the model

Use the environment variable CUDA_VISIBLE_DEVICES to select the correct GPU. Multi-GPU is not supported, but the model is very compressed, so 1 GPU should be enough. To use the instruction-tuned model, you can use the following commands in scripts/. Predefined scripts already there:

bash scripts/evaluate/tiny_llama_w2a16g32.sh    # for open task evaluation of the base model.
bash scripts/inference/llama2_70b_w2a16g8.sh     # for text generation inference of the base model.
bash scripts/instruction-chat/llama2_70b_w2a16g8.sh  # for instruction following chat of the fine-tuned model.
bash scripts/inference/codellama_34b_w2a16g8.sh         # for text generation inference of the codellama model

References

This code is based on:

Thanks to Meta AI for releasing LLaMA, a powerful LLM.

Citation

If you use our approach in your research, please cite our work as follows:

@article{low_bit_llama,
  title={Advanced Ultra-Low Bitrate Compression Techniques for the LLaMA Family of LLMs},
  author={Guo, Nianhui and Bethge, Joseph and Hu, Ting and Meinel, Christoph and Yang, Haojin},
  journal={https://github.com/GreenBitAI/low_bit_llama},
  year={2023}
}

License

The original code was released under its respective license and copyrights, i.e.:

  • datautils.py and evaluate.py: GPTQ for LLaMA released under Apache 2.0 License
  • model.py, peft_tuners_lora.py and inference.py (basis for llama_2b_*.py files): Alpaca_lora_4bit released under MIT License

We release our changes and additions to these files under the Apache 2.0 License.

Footnotes

  1. LLaMA-2 ↩ ↩2 ↩3 ↩4

  2. GPTQ ↩ ↩2 ↩3 ↩4 ↩5

  3. OmniQuant ↩ ↩2 ↩3 ↩4 ↩5

  4. LLaMA-1 ↩

  5. OpenLLaMA ↩

  6. TinyLLaMA ↩

  7. CodeLLaMA ↩ ↩2 ↩3

About

Advanced Ultra-Low Bitrate Compression Techniques for the LLaMA Family of LLMs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •