One intriguing trait of LLMs is the exhibition of outlier features, which are the features with significantly larger magnitudes than others. The paper of OWL claims to preserve outlier features. Recent paper Quantizable Transformer finds that the outliers are related to softmax function in attention. See blog for more details.
Increasing model size or data brings consistent performance improvements, even at very large scale. And this scaling behavior can be predictable by simple power-law curves.
Layer pruning is a technique to remove entire layers from the model. It is a coarse-grained pruning method, which may be very effective in some cases.
- Dimensional Mismatch Problem: When pruning intermediate layers, the input and output dimensions of subsequent layers may no longer match.
- Current LLMs Layer Pruning: Transformer blocks have the exactly same dimension of input and output due to the residual connection. Thus, layer pruning is feasible for LLMs. Did not suit for situation when mismatch between new input and old input, such as VGG layer pruning.
In contrast, width pruning is a fine-grained pruning method, which removes channels or neurons from each layer.