Network pruning is a technique for reducing the size and complexity of a machine learning model by removing unnecessary or redundant parts to reduce the memory footprint, improve training or inference speed, and reduce power consumption, with negligible accuracy loss. Pruning methods can be subdivided into methods that promote unstructured, structured, or semi-structured i.e. blocked sparsity, see the survey for a review.
Unstructured pruning also termed as weight pruning removes individual weights from the network. Such non-structured methods lead to irregular, sparse weight matrices (see right sub-figure of above figure, arbitrary weight can be pruned). This adds overheads for index structures and leads to less efficient execution on hardware that is optimized for dense computations. However, Unstructured pruning “is very fine-grained and makes pruning particularly powerful.” Thus, [inference engines](Hardware Support for Sparsity) have been developed to bring weight sparsity to AI accelerators.
Filter pruning is a form of structured sparsity by removing entire filters from the network’s layers. Filter pruning often achieves practical network compression and significant acceleration as entire feature maps are no longer computed.
Semi-structured pruning removes weights in block-wise from the network.
For example, the NVIDIA A100 GPU adds support for fine-grained structured sparsity, i.e., semi-structured to its Tensor Cores.
Sparse Tensor Cores accelerate a
Following figure shows some popular inference Engines that bring weight sparsity to AI accelerators.