Integrating useful resources into one repository for large models pruning papers, including one sentence take-away summary, explanation notes such as paper's challenges, blogs or videos, paper tags, source code links and venue.
Please feel free to pull requests or open an issue to add papers.
🔥 Keep updating... Please star it if you find it helpful:)
Click on the badge, such as , will direct you to the corresponding explanation file.
Unstructured |
Magnitude |
Sparsity e.g. layer or global |
Data-free |
Without |
Frozen |
Structured e.g. Channel, Layer/Depth |
Taylor e.g. Hessian |
FLOPs |
Calibration |
Efficient e.g. LoRA |
Update |
Semi-structured |
Fisher |
Latency |
Small |
Extensive |
- |
Other |
Trainable |
Energy |
Medium |
Scratch |
- |
- |
Other |
Other |
Large |
Other |
- |
Title & Take-away |
Categorization |
Note |
Code |
---|---|---|---|
ShortGPT: Layers in Large Language Models are More Redundant Than You Expect Delete certain layers, i.e., transformer blocks (given one block consists of both an Attention and an MLP) in LLMs based on Block Influence (BI) score, a novel metric designed to assess the hidden states transformation of each layer. Layers in LLMs could be more redundant than expected. |
|
Challenge | PyTorch |
The Unreasonable Ineffectiveness of the Deeper Layers A simple layer/depth pruning to remove n consecutive or contiguous layers from popular families of open-weight pretrained LLMs by minimizing the angular distance between layers' representations. Parameter-efficient finetuning method is applied to further reduce computational resources of finetuning. |
|
Challenge | PyTorch |
What Matters in Transformers? Not All Attention is Needed Explore the redundancy in three key Transformer components: Block, MLP, and Attention, where one Block = Attention + MLP (see ShortGPT). "Block/MLP drop" leads to significant performance degradation. A fine-grained "Attention Drop" has minimal impact on model accuracy, and alleviates memory overhead due to KV cache. Similarity-based metric to evaluate component's importance. |
|
Reviews | PyTorch |
Rethinking the Impact of Heterogeneous Sublayers in Transformers Instead of pruning entire coarse-grained transformer blocks, this paper proposed a finer granularity depth pruning method that prunes sublayers with treating single transformer block as 2 sublayers, i.e., Multi-Head Attention (MHA) and MLP. |
|
- | - |
Streamlining Redundant Layers to Compress Large Language Models LLM-Streamline comprises two components: layer pruning and layer replacement. First, certain contiguous redundant layers are pruned from the LLMs based on cosine similarity importance metric; Then, a lightweight network is trained on a small subset of SlimPajama to replace the pruned layers to restore the model’s performance. |
|
Challenge Reviews | - |
Reassessing Layer Pruning in LLMs: New Insights and Methods Validate seven different layer selection metrics including Random, Reverse-order, Magnitude, Taylor, Perplexity and Cosine Similarity (BI). Reverse-order pruning is simple yet effective. LoRA performs worse than a simple partial-layer fine-tuning. Iterative pruning offers no benefit compared to one-shot pruning. |
|
Challenge Reviews | PyTorch |
Title & Take-away |
Categorization |
Note |
Code |
---|---|---|---|
Shortened LLaMA: A Simple Depth Pruning for Large Language Models First identify unimportant Transformer blocks (bigger and coarse units), then perform one-shot pruning with Perplexity (PPL) as pruning criteria and light LoRA retraining. Show fast inference and good zero-shot capabilities. |
|
Challenge | PyTorch |
A deeper look at depth pruning of LLMs This work explores different block importance metrics including cosine similarity, relativeL1/L2 and Shapleyvalue-based, to take a deeper look at depth pruning of LLMs. Further exam the impact of droping individual Attention and MLP layers. Two simple performance recovery techniques are applied on calibration dataset. |
|
- | PyTorch |
Compact Language Models via Pruning and Knowledge Distillation Prune LLMs structurally along different axes such as layer, neuron, head, and embedding channel, similar to NAS that searches over different dimensions. Difference lies in the defined search space that for pruning a pre-trained large model as search space (simpler) while NAS searches over a manually-pre-defined search space (more complex) from scratch. Different proxy importance scores are estimated for depth and width pruning. Retraining with knowledge distillation requires up to 40x fewer training tokens. |
|
Challenge Blog |
PyTorch |
Keyformer: KV Cache reduction through attention sparsification for Efficient Generative Inference Keyformer, a successor to H2O (see below), uses a Gumbel softmax-based score function instead of solely attention scores in H2O, for dynamically identifying and retaining top-k key tokens, to reduce KV cache size. A sliding window drawn from Sparse Transformer is used to retain (not prune) w recent representative tokens, yileding a mixture of recent and key tokens. |
|
Challenge Blog Summary |
PyTorch |
A Simple and Effective Pruning Approach for Large Language Models A pruning metric termed Wanda that considers both weight magnitudes and input activation norms to prune weights per-output basis instead of layer-wise, requiring no retraining or weight update. A simplified version of SparseGPT. |
|
Challenge Blog Reviews |
PyTorch |
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits 1.58 bits to quantize every single parameter of the LLM in ternary -1, 0, or +1. This can be viewed as an 1-bit binarization -1 or 1 along with unstructured pruning 0. It matches the full-precision Transformer LLM with the same model size and training tokens when trained from scratch, with 1.58-bit weights and 8-bit activations. |
|
Challenge Discussion |
PyTorch |
BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation Adaptively allocate optimal sparsity ratio of each layer within a transformer block by minizming block-wise reconstruction error. To do so, a parameter-efficient algorithm is developed with ony optimizing few learnable coefficients e.g., 100. Pre-trained weights are frozen. |
|
Challenge Reviews |
PyTorch |
Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning In first stage, training-aware pruning learns masks satisfying specified target by imposing regularization on ~0.4B tokens; then retrain on other ~5B tokens of RedPajama dataset. Dynamic batch loading method to update the composition of sampled data per mini-batch across different domains. |
|
Challenge Blog Reviews |
PyTorch |
Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity Allocate non-uniform sparsity ratios across different layers guided by the principle that a layer with higher proportion of outliers should have a lower sparsity, then apply the more tailored layer-wise sparsity directly into Wanda and SparseGPT. |
|
Challenge Reviews |
PyTorch |
Plug-and-Play: An Efficient Post-training Pruning Method for Large Language Models Propose a new pruning criteria named RIA for LLMs. In N:M structures, introduce a column permutation matrix for score matrix to maximize the total retained weight importance. No retraining. |
|
Challenge Reviews |
PyTorch |
Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models Retrain LLMs' weights with lightweight LoRA, and optimize structured-pruning masks with efficient trainable parameters in differentiable way on instruction-tuning Alpaca dataset. Collaborative prompt is used to help pruning task. |
|
Challenge Reviews |
PyTorch |
Scaling Laws for Sparsely-Connected Foundation Models Discover scaling law of weight sparsity, formulating the scaling relationships between weight sparsity, non-zero parameter numbers, and training data size. Revealing an increasing optimal sparsity with more training data and offering insights for improved computational efficiency. |
|
Challenge Reviews |
- |
The LLM Surgeon This paper introduces LLM Surgeon, a method that enhances the efficiency of second-order Hessian-based pruning techniques, such as Optimal Brain Surgeon, by employing Kronecker-factored approximations of the Fisher information matrix. The approach establishes closed-form solutions. Prune OPT models and Llamav2-7B by 20%-30% achieves a negligible loss in performance. |
|
Challenge Reviews |
PyTorch |
Accurate Retraining-free Pruning for Pretrained Encoder-based Language Models Retraining-free pruning for encoder-based language model such as BERT to preserve the knowledge of PLMs through sublayer-wise iterative pruning, from the bottom to the top sublayer. |
|
Challenge Reviews |
PyTorch |
Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs Dynamic Sparse No Training (DSNT) involves iterative pruning-and-growing steps that only updating sparse mask yet mask adaptation by minimizing reconstruction error e.g. proxy of perplexity; Enable a higher 60% or 70% sparsity rate; Training-free. |
|
Challenge Reviews |
PyTorch |
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity An effective software framework for tensor cores (do not allow skipping arbitrary element-level computations) based unstructured SpMM, leveraging on-chip resources for efficient sparse data extraction and computation/memory-access overlapping. Improving memory bandwidth utilization in GPU. |
|
- | Python/C++ |
Title & Take-away |
Categorization |
Note |
Code |
---|---|---|---|
H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models During prompt and generation phase, dynamically prune the unimportant tokens based on accumulated attention scores, yet maintaining a constant small Key-Value Cache (KV cache ) size with k tokens. |
|
Challenge | PyTorch |
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot Post-training method for pruning LLMs in one-shot without any retraining. Updating weights by solving a layer-wise weight reconstruction problem. |
|
Challenge Blog |
PyTorch |
LLM-Pruner: On the Structural Pruning of Large Language Models First discover all coupled structures following Depgraph, then estimate grouped importance of coupled structure on calibration, then prune less important groups, and last finetune with efficient LoRA on Alpaca dataset consists of 50K instruction-response pairs. |
|
Challenge | PyTorch |
The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter Revisiting magnitude pruning and several interesting findings on pruning large scaled models. Most performances are reported with fine-tuned downstream tasks, except for that on modern-scale LLMs where no retraining is performed. |
|
Challenge | PyTorch |