awesome-model-compression-and-acceleration

Paper

Model compression as constrained optimization, with application to neural nets. Part I: general framework
Model compression as constrained optimization, with application to neural nets. Part II: quantization -A Survey of Model Compression and Acceleration for Deep Neural Networks

Dynamic Capacity Networks
ResNeXt: Aggregated Residual Transformations for Deep Neural Networks
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
Xception: Deep Learning with Depthwise Separable Convolutions
ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices
ThiNet: A Filter Level Pruning Method for Deep Neural Network Compression
SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size
Residual Attention Network for Image Classification
SEP-Nets: Small and Effective Pattern Networks
Deep Networks with Stochastic Depth
Learning Infinite Layer Networks Without the Kernel Trick
Coordinating Filters for Faster Deep Neural Networks
ResBinNet: Residual Binary Neural Network
Squeezedet: Uniﬁed, small, low power fully convolutional neural networks
Efficient Sparse-Winograd Convolutional Neural Networks
DSD: Dense-Sparse-Dense Training for Deep Neural Networks
Fast YOLO: A Fast You Only Look Once System for Real-time Embedded Object Detection in Video
Inverted Residuals and Linear Bottlenecks: Mobile Networks for Classification, Detection and Segmentation

Dark knowledge
FitNets: Hints for Thin Deep Nets
Net2net: Accelerating learning via knowledge transfer
Distilling the Knowledge in a Neural Network
MobileID: Face Model Compression by Distilling Knowledge from Neurons
DarkRank: Accelerating Deep Metric Learning via Cross Sample Similarities Transfer
Deep Model Compression: Distilling Knowledge from Noisy Teachers
Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer
Sequence-Level Knowledge Distillation
Like What You Like: Knowledge Distill via Neuron Selectivity Transfer
Learning Efficient Object Detection Models with Knowledge Distillation
Data-Free Knowledge Distillation For Deep Neural Networks
Learning Loss for Knowledge Distillation with Conditional Adversarial Networks
Knowledge Projection for Effective Design of Thinner and Faster Deep Neural Networks
Moonshine: Distilling with Cheap Convolutions
Model Distillation with Knowledge Transfer from Face Classification to Alignment and Verification

Local Binary Convolutional Neural Networks
Binarized Convolutional Neural Networks with Separable Filters for Efficient Hardware Acceleration
Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1
XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks
DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients

Quantize weights and activations in Recurrent Neural Networks
The ZipML Framework for Training Models with End-to-End Low Precision: The Cans, the Cannots, and a Little Bit of Deep Learning
Quantized Convolutional Neural Networks for Mobile Devices
Compressing Deep Convolutional Networks using Vector Quantization
Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations
Fixed-Point Performance Analysis of Recurrent Neural Networks
Loss-aware Binarization of Deep Networks
Towards the Limit of Network Quantization
Deep Learning with Low Precision by Half-wave Gaussian Quantization
ShiftCNN: Generalized Low-Precision Architecture for Inference of Convolutional Neural Networks
Trained Ternary Quantization

Data-Driven Sparse Structure Selection for Deep Neural Networks
Fine-Pruning: Joint Fine-Tuning and Compression of a Convolutional Network with Bayesian Optimization
Learning to Prune: Exploring the Frontier of Fast and Accurate Parsing
Designing Energy-Efficient Convolutional Neural Networks using Energy-Aware Pruning
Pruning Filters for Efficient ConvNets
Pruning Convolutional Neural Networks for Resource Efficient Inference
Soft Weight-Sharing for Neural Network Compression
Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding
Learning both Weights and Connections for Efficient Neural Networks
Dynamic Network Surgery for Efficient DNNs
ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA
Faster CNNs with Direct Sparse Convolutions and Guided Pruning

Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation
Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications
Efficient and Accurate Approximations of Nonlinear Convolutional Networks
Accelerating Very Deep Convolutional Networks for Classification and Detection
Convolutional neural networks with low-rank regularization
Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation
Speeding up convolutional neural networks with low rank expansions