The official implementation of Arch-Net: Model Distillation for Architecture Agnostic Model Deployment
TL;DR Arch-Net is a family of neural networks made up of simple and efficient operators. When a Arch-Net is produced, less common network constructs, like Layer Normalization and Embedding Layers, are eliminated in a progressive manner through label-free Blockwise Model Distillation, while performing sub-eight bit quantization at the same time to maximize performance. For the classification task, only 30k unlabeled images randomly sampled from ImageNet dataset is needed.
ImageNet Classification
Model | Bit Width | Top1 | Top5 |
---|---|---|---|
Arch-Net_Resnet18 | 32w32a | 69.76 | 89.08 |
Arch-Net_Resnet18 | 2w4a | 68.77 | 88.66 |
Arch-Net_Resnet34 | 32w32a | 73.30 | 91.42 |
Arch-Net_Resnet34 | 2w4a | 72.40 | 91.01 |
Arch-Net_Resnet50 | 32w32a | 76.13 | 92.86 |
Arch-Net_Resnet50 | 2w4a | 74.56 | 92.39 |
Arch-Net_MobilenetV1 | 32w32a | 68.79 | 88.68 |
Arch-Net_MobilenetV1 | 2w4a | 67.29 | 88.07 |
Arch-Net_MobilenetV2 | 32w32a | 71.88 | 90.29 |
Arch-Net_MobilenetV2 | 2w4a | 69.09 | 89.13 |
Multi30k Machine Translation
Model | translation direction | Bit Width | BLEU |
---|---|---|---|
Transformer | English to Gemany | 32w32a | 32.44 |
Transformer | English to Gemany | 2w4a | 33.75 |
Transformer | English to Gemany | 4w4a | 34.35 |
Transformer | English to Gemany | 8w8a | 36.44 |
Transformer | Gemany to English | 32w32a | 30.32 |
Transformer | Gemany to English | 2w4a | 32.50 |
Transformer | Gemany to English | 4w4a | 34.34 |
Transformer | Gemany to English | 8w8a | 34.05 |
python == 3.6
refer to requirements.txt for more details
Download ImageNet and multi30k data(google drive or BaiduYun, code: 8brd) and put them in ./arch-net/data/ as follow:
./data/
├── imagenet
│ ├── train
│ ├── val
├── multi30k
Download teacher models at google drive or BaiduYun(code: 57ew) and put them in ./arch-net/models/teacher/pretrained_models/
train and evaluate
cd ./train_imagenet
python3 -m torch.distributed.launch --nproc_per_node=8 train_archnet_resnet18.py -j 8 --weight-bit 2 --feature-bit 4 --lr 0.001 --num_gpus 8 --sync-bn
evaluate if you already have the trained models
python3 -m torch.distributed.launch --nproc_per_node=8 train_archnet_resnet18.py -j 8 --weight-bit 2 --feature-bit 4 --lr 0.001 --num_gpus 8 --sync-bn --evaluate
train a arch-net_transformer of 2w4a
cd ./train_transformer
python3 train_archnet_transformer.py --translate_direction en2de --teacher_model_path ../models/teacher/pretrained_models/transformer_en_de.chkpt --data_pkl ../data/multi30k/m30k_ende_shr.pkl --batch_size 48 --final_epochs 50 --weight_bit 2 --feature_bit 4 --lr 1e-3 --weight_decay 1e-6 --label_smoothing
- for arch-net_transformer of 8w8a, use the lr of 1e-3 and the weight decay of 1e-4
evaluate
cd ./evaluate
python3 translate.py --data_pkl ./data/multi30k/m30k_ende_shr.pkl --model path_to_the_outptu_directory/model_max_acc.chkpt
- to get the BLEU of the evaluated results, go to this website, and then upload 'predictions.txt' in the output directory and the 'gt_en.txt' or 'gt_de.txt' in ./arch-net/data_gt/multi30k/
If you find this project useful for your research, please consider citing the paper.
@misc{xu2021archnet,
title={Arch-Net: Model Distillation for Architecture Agnostic Model Deployment},
author={Weixin Xu and Zipeng Feng and Shuangkang Fang and Song Yuan and Yi Yang and Shuchang Zhou},
year={2021},
eprint={2111.01135},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
attention-is-all-you-need-pytorch
If you have any questions, feel free to open an issue or contact us at xuweixin02@megvii.com.