Pruning Criteria

Network pruning can be phrased as a form of neural architecture search. Such an architecture search may involve training and evaluating several random subnetworks based on leave-some-out approaches, but this can be quite expensive for large models. More efficient approaches assign an importance score as pruning criteria, see following table.

Criteria	Explanation
`Magnitude`	$L_1$ or $L_2$ norm of the weights
`Taylor`	Taylor expansion including 1st or 2nd order i.e., Hessian information
`Wanda`	Importance score as the elementwise product between the weight magnitude and the norm of input activations $\mid W_{ij} \mid \times \parallel X_i\parallel_2$
`RIA`	Relative Importance and Activations $\left(\frac{\mid W_{i j}\mid}{\sum\mid W_{* j}\mid}+\frac{\mid W_{i j}\mid}{\sum\mid W_{i *}\mid}\right) \times\left(\parallel X_i\parallel_2\right)^a$
`Geometric-Median`	Consider the relationship between filters
`KL-Divergence`	Kullback–leibler divergence to measure probability distributions
`Perplexity (PPL)`	Remove each Transformer block and monitor its influence on PPL: $I_{\mathrm{PPL}}^n = \exp \left( -\frac{1}{S L} \sum_s \sum_l \log p_{\theta^n}\left(x_l^{(s)} \mid x_{<l}^{(s)}\right) \right)$ derived from the next-token prediction loss, only forward pass required
`Cosine Similarity`	Cosine similarity between activations before and after a block/attention/MLP layers, performed as a scale-invariant metric to measure the degree of transformation of each component: $BI_i = 1 - E_{X, t} \left( \frac{X_{i, t}^T X_{i+1, t}}{\mid X_{i, t} \mid_2 \mid X_{i+1, t} \mid_2} \right)$
`Trainable`	Trainable importance score dynamically changes during training

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

criteria.md

criteria.md

Pruning Criteria

Files

criteria.md

Latest commit

History

criteria.md

File metadata and controls

Pruning Criteria