Skip to content

Latest commit

 

History

History
26 lines (18 loc) · 4.82 KB

criteria.md

File metadata and controls

26 lines (18 loc) · 4.82 KB

Pruning Criteria

Network pruning can be phrased as a form of neural architecture search. Such an architecture search may involve training and evaluating several random subnetworks based on leave-some-out approaches, but this can be quite expensive for large models. More efficient approaches assign an importance score as pruning criteria, see following table.

Criteria Explanation
Magnitude $L_1$ or $L_2$ norm of the weights
Taylor Taylor expansion including 1st or 2nd order i.e., Hessian information
Wanda Importance score as the elementwise product between the weight magnitude and the norm of input activations $\mid W_{ij} \mid \times \parallel X_i\parallel_2$
RIA Relative Importance and Activations $\left(\frac{\mid W_{i j}\mid}{\sum\mid W_{* j}\mid}+\frac{\mid W_{i j}\mid}{\sum\mid W_{i *}\mid}\right) \times\left(\parallel X_i\parallel_2\right)^a$
Geometric-Median Consider the relationship between filters
KL-Divergence Kullback–leibler divergence to measure probability distributions
Perplexity (PPL) Remove each Transformer block and monitor its influence on PPL: $I_{\mathrm{PPL}}^n = \exp \left( -\frac{1}{S L} \sum_s \sum_l \log p_{\theta^n}\left(x_l^{(s)} \mid x_{<l}^{(s)}\right) \right)$ derived from the next-token prediction loss, only forward pass required
Cosine Similarity Cosine similarity between activations before and after a block/attention/MLP layers, performed as a scale-invariant metric to measure the degree of transformation of each component: $BI_i = 1 - E_{X, t} \left( \frac{X_{i, t}^T X_{i+1, t}}{\mid X_{i, t} \mid_2 \mid X_{i+1, t} \mid_2} \right)$
Trainable Trainable importance score dynamically changes during training