This is a Pytorch implementation of our ACMMM2022 paper. We have presented a new gating unit PoSGU which replace the FC layer in SGU of gMLP with relative positional encoding methods (Spercifically, LRPE and GQPE) and used it as the key building block to develop a new vision MLP architecture referred to as the PosMLP. We also hope this work will inspire further theoretical study of positional encoding in vision MLPs and could have a mature application as in vision Transformers.
Our code is based on the pytorch-image-models, attention-cnn, swim-transformer,vision-Permutator
Model | Parameters | Image resolution | Top 1 Acc. | Download |
---|---|---|---|---|
gMLP-S | 20M | 224 | 79.6% | |
Hire-MLP-S | 33M | 224 | 81.8% | |
ViP-Small/7 | 25M | 224 | 81.5% | |
PosMLP-T | 21M | 224 | 82.1% | 百度云盘/GoogleDrive |
S2-MLP-deep | 51M | 224 | 80.7% | |
Mixer-B/16 | 59M | 224 | 78.5% | |
ViP-Medium/7 | 55M | 224 | 82.7% | |
AS-MLP-S | 50M | 224 | 83.1% | |
PosMLP-S | 37M | 224 | 83.0% | released soon |
gMLP-B | 73M | 224 | 81.6% | |
ResMLP-B24 | 116M | 224 | 81.0% | |
ViP-Large/7 | 88M | 224 | 83.2% | |
Hire-MLP-L | 96M | 224 | 83.4% | |
PosMLP-B | 82M | 224 | 83.6% |
The experiments are conducted on 8 RTX 3090 gpus.
torch>=1.4.0
torchvision>=0.5.0
pyyaml
timm==0.4.5
apex if you use 'apex amp'
data prepare: ImageNet with the following folder structure, you can extract imagenet by this script. Please update the data folder path in config files.
│imagenet/
├──train/
│ ├── n01440764
│ │ ├── n01440764_10026.JPEG
│ │ ├── n01440764_10027.JPEG
│ │ ├── ......
│ ├── ......
├──val/
│ ├── n01440764
│ │ ├── ILSVRC2012_val_00000293.JPEG
│ │ ├── ILSVRC2012_val_00002138.JPEG
│ │ ├── ......
│ ├── ......
Command line for training PosMLP-T on 4 GPUs (RTX 3090)
bash scripts/distributed_train.sh
Please download the checkpoint from above here and specify the data and model paths in the script, and test with command
CUDA_VISIBLE_DEVICES=0 bash scripts/test.sh
This repository is released under the MIT License as found in the LICENSE file.