This repository contains code to reproduce the key results of the paper BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation.
torch
: tested on v2.0.1+cu118transformers
: tested on v4.31.0accelerate
: tested on v0.21.0datasets
: tested on v2.14.4timm
: tested on v0.9.5
lm-evaluation-harness
git clone https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .
Customized Cuda Operator
cd models/ops
python setup.py install
Here is the command to run baseline experiments followed by perplexity evaluations on WikiText2, PTB, C4 and zero-shot tasks. See also the CMD-argument documentation.
bash main_exps.sh
In the experiment section of our paper, we present the results of row-wise sparsity, which customize sparsity for each row of target layer's weight within in the block. Additionally, we provide an extension presenting the outcomes of layer-wise sparsity, where each row of the target layer is assigned uniform sparsity. You can find the commands to execute the layer-wise sparsity experiments in the main_exps.sh script. Below, we present the perplexity results for the Wikitext2 dataset.
1-7B | 1-13B | 1-30B | 1-65B | 2-7B | 2-13B | 2-70B | |
---|---|---|---|---|---|---|---|
Dense | 5.68 | 5.09 | 4.10 | 3.53 | 5.47 | 4.88 | 3.31 |
SparseGPT | 7.22 | 6.21 | 5.33 | 4.60 | 6.99 | 6.02 | 4.25 |
Wanda | 7.26 | 6.15 | 5.25 | 4.60 | 6.92 | 5.97 | 4.22 |
BESA (layer-wise) | 7.04 | 6.07 | 5.16 | 4.51 | 6.77 | 5.85 | 4.14 |
BESA (row-wise) | 6.86 | 5.92 | 5.00 | 4.33 | 6.60 | 5.75 | 4.09 |