Awesome-LLM-Compression

Awesome LLM compression research papers and tools to accelerate the LLM training and inference.

ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers
NeurIPS 2022 [Paper] [Code]
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
NeurIPS 2022 [Paper] [Code]
LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models
Arxiv 2022 [Paper]
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
ICML 2023 [Paper] [Code]
FlexRound: Learnable Rounding based on Element-wise Division for Post-Training Quantization
ICML 2023 [Paper] [Code(DeepSpeed)]
Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases
ICML 2023 [Paper] [Code]
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
ICLR 2023 [Paper] [Code]
RPTQ: Reorder-based Post-training Quantization for Large Language Models
Arxiv 2023 [Paper] [Code]
PreQuant: A Task-agnostic Quantization Approach for Pre-trained Language Models
ACL 2023 [Paper]
Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling
Arxiv 2023 [Paper]
Quantized Distributed Training of Large Models with Convergence Guarantees
Arxiv 2023 [Paper]
ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation
Arxiv 2023 [Paper] [Code]
QLoRA: Efficient Finetuning of Quantized LLMs
Arxiv 2023 [Paper] [Code]
Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models
Arxiv 2023 [Paper]
The Quantization Model of Neural Scaling
Arxiv 2023 [Paper]
Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization
Arxiv 2023 [Paper]
Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM Inference with Transferable Prompt
Arxiv 2023 [Paper]
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
Arxiv 2023 [Paper] [Code]
LLM-QAT: Data-Free Quantization Aware Training for Large Language Models
Arxiv 2023 [Paper] [Code]
SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression
Arxiv 2023 [Paper] [Code]
OWQ: Lessons learned from activation outliers for weight quantization in large language models
Arxiv 2023 [Paper]
SqueezeLLM: Dense-and-Sparse Quantization
Arxiv 2023 [Paper] [Code]
INT2.1: Towards Fine-Tunable Quantized Large Language Models with Error Correction through Low-Rank Adaptation
Arxiv 2023 [Paper]
INT-FP-QSim: Mixed Precision and Formats For Large Language Models and Vision Transformers
Arxiv 2023 [Paper] [Code]
QIGen: Generating Efficient Kernels for Quantized Inference on Large Language Models
Arxiv 2023 [Paper] [Code]
Do Emergent Abilities Exist in Quantized Large Language Models: An Empirical Study
Arxiv 2023 [Paper]
ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats
Arxiv 2023 [Paper] [Code (DeepSpeed)]
OliVe: Accelerating Large Language Models via Hardware-friendly Outlier-Victim Pair Quantization
ISCA 2023 [Paper]
QuIP: 2-Bit Quantization of Large Language Models With Guarantees
Arxiv 2023 [Paper] [Code]
NUPES : Non-Uniform Post-Training Quantization via Power Exponent Search
Arxiv 2023 [Paper]
GPT-Zip: Deep Compression of Finetuned Large Language Models
ICML 2023 Workshop ES-FoMO [Paper]
Generating Efficient Kernels for Quantized Inference on Large Language Models
ICML 2023 Workshop ES-FoMO [Paper]
Gradient-Based Post-Training Quantization: Challenging the Status Quo
Arxiv 2023 [Paper]
FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs
Arxiv 2023 [Paper]

Pruning and Sparsity

The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers
ICLR 2023 [Paper]
Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time
ICML 2023 [Paper] [Code]
LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation
ICML 2023 [Paper] [Code]
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot
Arxiv 2023 [Paper] [Code]
LLM-Pruner: On the Structural Pruning of Large Language Models
Arxiv 2023 [Paper] [Code]
Prune and Tune: Improving Efficient Pruning Techniques for Massive Language Models
ICLR 2023 TinyPapers [Paper]
Unlocking Context Constraints of LLMs: Enhancing Context Efficiency of LLMs with Self-Information-Based Content Filtering
Arxiv 2023 [Paper] [Code]
Rethinking the Role of Scale for In-Context Learning: An Interpretability-based Case Study at 66 Billion Scale
Arxiv 2023 [Paper] [Code]
A Simple and Effective Pruning Approach for Large Language Models
Arxiv 2023 [Paper] [Code]
Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning
Arxiv 2023 [Paper]

Distillation

Lifting the Curse of Capacity Gap in Distilling Language Models
ACL 2023 [Paper] [Code]
Symbolic Chain-of-Thought Distillation: Small Models Can Also "Think" Step-by-Step
ACL 2023 [Ppaer]
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes
ACL 2023 [Paper]
SCOTT: Self-Consistent Chain-of-Thought Distillation
ACL 2023 [Paper]
DISCO: Distilling Counterfactuals with Large Language Models
ACL 2023 [Paper] [Code]
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions
Arxiv 2023 [Paper] [Code]
Large Language Model Distillation Doesn't Need a Teacher
Arxiv 2023 [Paper] [Code]
The False Promise of Imitating Proprietary LLMs
Arxiv 2023 [Paper]
GPT4All: Training an Assistant-style Chatbot with Large Scale Data Distillation from GPT-3.5-Turbo
Arxiv 2023 [Paper] [Code]
PaD: Program-aided Distillation Specializes Large Models in Reasoning
Arxiv 2023 [Paper]
Knowledge Distillation of Large Language Models
Arxiv 2023 [Paper] [Code]
GKD: Generalized Knowledge Distillation for Auto-regressive Sequence Models
Arxiv 2023 [Paper]
Chain-of-Thought Prompt Distillation for Multimodal Named Entity and Multimodal Relation Extraction
Arxiv 2023 [Paper]
Task-agnostic Distillation of Encoder-Decoder Language Models
Arxiv 2023 [Paper]
Lion: Adversarial Distillation of Closed-Source Large Language Model
Arxiv 2023 [Paper] [Code]

Efficient Prompting

Did You Read the Instructions? Rethinking the Effectiveness of Task Definitions in Instruction Learning
ACL 2023 [Paper] [Code]
Efficient Prompting via Dynamic In-Context Learning
Arxiv 2023 [Paper]
Learning to Compress Prompts with Gist Tokens
Arxiv 2023 [Paper] [Code]
Batch Prompting: Efficient Inference with Large Language Model APIs
Arxiv 2023 [Paper] [Code]
Adapting Language Models to Compress Contexts
Arxiv 2023 [Paper] [Code]
In-context Autoencoder for Context Compression in a Large Language Model
Arxiv 2023 [Paper]

Other

TensorGPT: Efficient Compression of the Embedding Layer in LLMs based on the Tensor-Train Decomposition
Arxiv 2023 [Paper]
Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers
Arxiv 2023 [Paper]
SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference
Arxiv 2023 [Paper]
Scaling In-Context Demonstrations with Structured Attention
Arxiv 2023 [Paper]
Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline
Arxiv 2023 [Paper] [Code]
Text Alignment Is An Efficient Unified Model for Massive NLP Tasks
Arxiv 2023 [Paper] [Code]
CPET: Effective Parameter-Efficient Tuning for Compressed Large Language Models
Arxiv 2023 [Paper]

Tools

BMCook: Model Compression for Big Models [Code]
llama.cpp: Inference of LLaMA model in pure C/C++ [Code]
LangChain: Building applications with LLMs through composability [Code]
GPTQ-for-LLaMA: 4 bits quantization of LLaMA using GPTQ [Code]
Alpaca-CoT: An Instruction Fine-Tuning Platform with Instruction Data Collection and Unified Large Language Models Interface [Code]
vllm: A high-throughput and memory-efficient inference and serving engine for LLMs [Code]
LLaMA Efficient Tuning: Fine-tuning LLaMA with PEFT (PT+SFT+RLHF with QLoRA) [Code]
Efficient-Tuning-LLMs: (Efficient Finetuning of QLoRA LLMs). QLoRA, LLama, bloom, baichuan-7B, GLM [Code]
bitsandbytes: 8-bit CUDA functions for PyTorch [Code]
ExLlama: A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights. [Code]
lit-gpt: Hackable implementation of state-of-the-art open-source LLMs based on nanoGPT. Supports flash attention, 4-bit and 8-bit quantization, LoRA and LLaMA-Adapter fine-tuning, pre-training. [Code]
Lit-LLaMA: Implementation of the LLaMA language model based on nanoGPT. Supports flash attention, Int8 and GPTQ 4bit quantization, LoRA and LLaMA-Adapter fine-tuning, pre-training. [Code]
lama.onnx: LLaMa/RWKV onnx models, quantization and testcase [Code]
fastLLaMa: An experimental high-performance framework for running Decoder-only LLMs with 4-bit quantization in Python using a C/C++ backend. [Code]
Sparsebit: A model compression and acceleration toolbox based on pytorch. [Code]

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome-LLM-Compression

Contents

Papers

Survey

Quantization

Pruning and Sparsity

Distillation

Efficient Prompting

Other

Tools

About

Releases

Packages

License

sijial430/Awesome-LLM-Compression

Folders and files

Latest commit

History

Repository files navigation

Awesome-LLM-Compression

Contents

Papers

Survey

Quantization

Pruning and Sparsity

Distillation

Efficient Prompting

Other

Tools

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages