Skip to content

Awesome LLM pruning papers all-in-one repository with integrating all useful resources and insights.

Notifications You must be signed in to change notification settings

liyunqianggyn/Awesome-LLMs-Pruning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

64 Commits
 
 
 
 
 
 

Repository files navigation

Awesome LLMs Pruning

Awesome

Integrating useful resources into one repository for large models pruning papers, including one sentence take-away summary, explanation notes such as paper's challenges, blogs or videos, paper tags, source code links and venue.

Please feel free to pull requests or open an issue to add papers.

🔥 Keep updating... Please star it if you find it helpful:)

Table of Contents

Tags of Pruning

Click on the badge, such as Budget, will direct you to the corresponding explanation file.

Type Criteria Budget Budget Type Type
Unstructured Magnitude Sparsity e.g. layer or global Data-free Without Frozen
Structured e.g. Channel, Layer/Depth Taylor e.g. Hessian FLOPs Calibration Efficient e.g. LoRA Update
Semi-structured Fisher Latency Small Extensive -
Other Trainable Energy Medium Scratch -
- Other Other Large Other -

2025

Title & Take-away
Categorization
Note
Code
Star Publish
ShortGPT: Layers in Large Language Models are More Redundant Than You Expect
Delete certain layers, i.e., transformer blocks (given one block consists of both an Attention and an MLP) in LLMs based on Block Influence (BI) score, a novel metric designed to assess the hidden states transformation of each layer. Layers in LLMs could be more redundant than expected.
Type
Type
Type
Type
Type
Type
Challenge PyTorch
Star Publish
The Unreasonable Ineffectiveness of the Deeper Layers
A simple layer/depth pruning to remove n consecutive or contiguous layers from popular families of open-weight pretrained LLMs by minimizing the angular distance between layers' representations. Parameter-efficient finetuning method is applied to further reduce computational resources of finetuning.
Type
Type
Type
Type
Type
Type
Challenge PyTorch
Star Publish
What Matters in Transformers? Not All Attention is Needed
Explore the redundancy in three key Transformer components: Block, MLP, and Attention, where one Block = Attention + MLP (see ShortGPT). "Block/MLP drop" leads to significant performance degradation. A fine-grained "Attention Drop" has minimal impact on model accuracy, and alleviates memory overhead due to KV cache. Similarity-based metric to evaluate component's importance.
Type
Type
Type
Type
Type
Type
Reviews PyTorch
Publish
Rethinking the Impact of Heterogeneous Sublayers in Transformers
Instead of pruning entire coarse-grained transformer blocks, this paper proposed a finer granularity depth pruning method that prunes sublayers with treating single transformer block as 2 sublayers, i.e., Multi-Head Attention (MHA) and MLP.
Type
Type
Type
Type
Type
Type
- -
Publish
Streamlining Redundant Layers to Compress Large Language Models
LLM-Streamline comprises two components: layer pruning and layer replacement. First, certain contiguous redundant layers are pruned from the LLMs based on cosine similarity importance metric; Then, a lightweight network is trained on a small subset of SlimPajama to replace the pruned layers to restore the model’s performance.
Type
Type
Type
Type
Type
Type
Challenge Reviews -
Star Publish
Reassessing Layer Pruning in LLMs: New Insights and Methods
Validate seven different layer selection metrics including Random, Reverse-order, Magnitude, Taylor, Perplexity and Cosine Similarity (BI). Reverse-order pruning is simple yet effective. LoRA performs worse than a simple partial-layer fine-tuning. Iterative pruning offers no benefit compared to one-shot pruning.
Type
Type
Type
Type
Type
Type
Challenge Reviews PyTorch

2024

Title & Take-away
Categorization
Note
Code
Star Publish
Shortened LLaMA: A Simple Depth Pruning for Large Language Models
First identify unimportant Transformer blocks (bigger and coarse units), then perform one-shot pruning with Perplexity (PPL) as pruning criteria and light LoRA retraining. Show fast inference and good zero-shot capabilities.
Type
Type
Type
Type
Type
Type
Challenge PyTorch
Star Publish
A deeper look at depth pruning of LLMs
This work explores different block importance metrics including cosine similarity, relativeL1/L2 and Shapleyvalue-based, to take a deeper look at depth pruning of LLMs. Further exam the impact of droping individual Attention and MLP layers. Two simple performance recovery techniques are applied on calibration dataset.
Type
Type
Type
Type
Type
Type
- PyTorch
Star Publish
Compact Language Models via Pruning and Knowledge Distillation
Prune LLMs structurally along different axes such as layer, neuron, head, and embedding channel, similar to NAS that searches over different dimensions. Difference lies in the defined search space that for pruning a pre-trained large model as search space (simpler) while NAS searches over a manually-pre-defined search space (more complex) from scratch. Different proxy importance scores are estimated for depth and width pruning. Retraining with knowledge distillation requires up to 40x fewer training tokens.
Type
Type
Type
Type
Type
Type
Challenge
Blog
PyTorch
Star Publish
Keyformer: KV Cache reduction through attention sparsification for Efficient Generative Inference
Keyformer, a successor to H2O (see below), uses a Gumbel softmax-based score function instead of solely attention scores in H2O, for dynamically identifying and retaining top-k key tokens, to reduce KV cache size. A sliding window drawn from Sparse Transformer is used to retain (not prune) w recent representative tokens, yileding a mixture of recent and key tokens.
Type
Type
Type
Type
Type
Type
Challenge
Blog
Summary
PyTorch
Star Publish
A Simple and Effective Pruning Approach for Large Language Models
A pruning metric termed Wanda that considers both weight magnitudes and input activation norms to prune weights per-output basis instead of layer-wise, requiring no retraining or weight update. A simplified version of SparseGPT.
Type
Type
Type
Type
Type
Type
Challenge
Blog
Reviews
PyTorch
Publish
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
1.58 bits to quantize every single parameter of the LLM in ternary -1, 0, or +1. This can be viewed as an 1-bit binarization -1 or 1 along with unstructured pruning 0. It matches the full-precision Transformer LLM with the same model size and training tokens when trained from scratch, with 1.58-bit weights and 8-bit activations.
Type
Type
Type
Type
Type
Type
Challenge
Discussion
PyTorch
Star Publish
BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation
Adaptively allocate optimal sparsity ratio of each layer within a transformer block by minizming block-wise reconstruction error. To do so, a parameter-efficient algorithm is developed with ony optimizing few learnable coefficients e.g., 100. Pre-trained weights are frozen.
Type
Type
Type
Type
Type
Type
Challenge
Reviews
PyTorch
Star Publish
Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning
In first stage, training-aware pruning learns masks satisfying specified target by imposing regularization on ~0.4B tokens; then retrain on other ~5B tokens of RedPajama dataset. Dynamic batch loading method to update the composition of sampled data per mini-batch across different domains.
Type
Type
Type
Type
Type
Type
Challenge
Blog
Reviews
PyTorch
Star Publish
Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity
Allocate non-uniform sparsity ratios across different layers guided by the principle that a layer with higher proportion of outliers should have a lower sparsity, then apply the more tailored layer-wise sparsity directly into Wanda and SparseGPT.
Type
Type
Type
Type
Type
Type
Challenge
Reviews
PyTorch
Star Publish
Plug-and-Play: An Efficient Post-training Pruning Method for Large Language Models
Propose a new pruning criteria named RIA for LLMs. In N:M structures, introduce a column permutation matrix for score matrix to maximize the total retained weight importance. No retraining.
Type
Type
Type
Type
Type
Type
Challenge
Reviews
PyTorch
Publish
Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models
Retrain LLMs' weights with lightweight LoRA, and optimize structured-pruning masks with efficient trainable parameters in differentiable way on instruction-tuning Alpaca dataset. Collaborative prompt is used to help pruning task.
Type
Type
Type
Type
Type
Type
Challenge
Reviews
PyTorch
Publish
Scaling Laws for Sparsely-Connected Foundation Models
Discover scaling law of weight sparsity, formulating the scaling relationships between weight sparsity, non-zero parameter numbers, and training data size. Revealing an increasing optimal sparsity with more training data and offering insights for improved computational efficiency.
Type
Type
Type
Type
Type
Type
Challenge
Reviews
-
Star Publish
The LLM Surgeon
This paper introduces LLM Surgeon, a method that enhances the efficiency of second-order Hessian-based pruning techniques, such as Optimal Brain Surgeon, by employing Kronecker-factored approximations of the Fisher information matrix. The approach establishes closed-form solutions. Prune OPT models and Llamav2-7B by 20%-30% achieves a negligible loss in performance.
Type
Type
Type
Type
Type
Type
Challenge
Reviews
PyTorch
Star Publish
Accurate Retraining-free Pruning for Pretrained Encoder-based Language Models
Retraining-free pruning for encoder-based language model such as BERT to preserve the knowledge of PLMs through sublayer-wise iterative pruning, from the bottom to the top sublayer.
Type
Type
Type
Type
Type
Type
Challenge
Reviews
PyTorch
Star Publish
Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs
Dynamic Sparse No Training (DSNT) involves iterative pruning-and-growing steps that only updating sparse mask yet mask adaptation by minimizing reconstruction error e.g. proxy of perplexity; Enable a higher 60% or 70% sparsity rate; Training-free.
         
Type
Type
Type
Type
Type
Type
Challenge
Reviews
PyTorch
Star Publish
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
An effective software framework for tensor cores (do not allow skipping arbitrary element-level computations) based unstructured SpMM, leveraging on-chip resources for efficient sparse data extraction and computation/memory-access overlapping. Improving memory bandwidth utilization in GPU.
Type
Type
Type
Type
Type
Type
- Python/C++

2023

Title & Take-away
Categorization
Note
Code
Star Publish
H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
During prompt and generation phase, dynamically prune the unimportant tokens based on accumulated attention scores, yet maintaining a constant small Key-Value Cache (KV cache ) size with k tokens.
Type
Type
Type
Type
Type
Type
Challenge PyTorch
Star Publish
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot
Post-training method for pruning LLMs in one-shot without any retraining. Updating weights by solving a layer-wise weight reconstruction problem.
Type
Type
Type
Type
Type
Type
Challenge
Blog
PyTorch
Star Publish
LLM-Pruner: On the Structural Pruning of Large Language Models
First discover all coupled structures following Depgraph, then estimate grouped importance of coupled structure on calibration, then prune less important groups, and last finetune with efficient LoRA on Alpaca dataset consists of 50K instruction-response pairs.
Type
Type
Type
Type
Type
Type
Challenge PyTorch
Star Publish
The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter
Revisiting magnitude pruning and several interesting findings on pruning large scaled models. Most performances are reported with fine-tuned downstream tasks, except for that on modern-scale LLMs where no retraining is performed.
Type
Type
Type
Type
Type
Type
Challenge PyTorch

About

Awesome LLM pruning papers all-in-one repository with integrating all useful resources and insights.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published