Network pruning can be phrased as a form of neural architecture search. Such an architecture search may involve training and evaluating several random subnetworks based on leave-some-out approaches, but this can be quite expensive for large models. More efficient approaches assign an importance score as pruning criteria, see following table.
Criteria | Explanation |
---|---|
Magnitude |
|
Taylor |
Taylor expansion including 1st or 2nd order i.e., Hessian information |
Wanda |
Importance score as the elementwise product between the weight magnitude and the norm of input activations |
RIA |
Relative Importance and Activations |
Geometric-Median |
Consider the relationship between filters |
KL-Divergence |
Kullback–leibler divergence to measure probability distributions |
Perplexity (PPL) |
Remove each Transformer block and monitor its influence on PPL: |
Cosine Similarity |
Cosine similarity between activations before and after a block/attention/MLP layers, performed as a scale-invariant metric to measure the degree of transformation of each component: |
Trainable |
Trainable importance score dynamically changes during training |