This is a PyTorch implementation of MoCo v3 for self-supervised ResNet and ViT.
The original MoCo v3 was implemented in Tensorflow and run in TPUs. This repo re-implements in PyTorch and GPUs. Despite the library and numerical differences, this repo reproduces the results and observations in the paper.
The following results are based on ImageNet-1k self-supervised pre-training, followed by ImageNet-1k supervised training for linear evaluation or end-to-end fine-tuning. All results in these tables are based on a batch size of 4096.
Pre-trained models and configs can be found at CONFIG.md.
pretrain epochs |
pretrain crops |
linear acc |
---|---|---|
100 | 2x224 | 68.9 |
300 | 2x224 | 72.8 |
1000 | 2x224 | 74.6 |
model | pretrain epochs |
pretrain crops |
linear acc |
---|---|---|---|
ViT-Small | 300 | 2x224 | 73.2 |
ViT-Base | 300 | 2x224 | 76.7 |
model | pretrain epochs |
pretrain crops |
e2e acc |
---|---|---|---|
ViT-Small | 300 | 2x224 | 81.4 |
ViT-Base | 300 | 2x224 | 83.2 |
The end-to-end fine-tuning results are obtained using the DeiT repo, using all the default DeiT configs. ViT-B is fine-tuned for 150 epochs (vs DeiT-B's 300ep, which has 81.8% accuracy).
Install PyTorch and download the ImageNet dataset following the official PyTorch ImageNet training code. Similar to MoCo v1/2, this repo contains minimal modifications on the official PyTorch ImageNet code. We assume the user can successfully run the official PyTorch ImageNet code.
For ViT models, install timm (timm==0.4.9
).
The code has been tested with CUDA 10.2/CuDNN 7.6.5, PyTorch 1.9.0 and timm 0.4.9.
conda create -n mocov3 python=3.8
conda activate mocov3
pip install -r requirements.txt
Below are three examples for MoCo v3 pre-training.
On the first node, run:
python main_moco.py \
--moco-m-cos --crop-min=.2 \
--dist-url 'tcp://[your first node address]:[specified port]' \
--multiprocessing-distributed --world-size 2 --rank 0 \
[your imagenet-folder with train and val folders]
On the second node, run the same command with --rank 1
.
With a batch size of 4096, the training can fit into 2 nodes with a total of 16 Volta 32G GPUs.
python main_moco.py \
-a vit_small -b 1024 \
--optimizer=adamw --lr=1.5e-4 --weight-decay=.1 \
--epochs=300 --warmup-epochs=40 \
--stop-grad-conv1 --moco-m-cos --moco-t=.2 \
--dist-url 'tcp://localhost:10001' \
--multiprocessing-distributed --world-size 1 --rank 0 \
[your imagenet-folder with train and val folders]
With a batch size of 4096, ViT-Base is trained with 8 nodes:
python main_moco.py \
-a vit_base \
--optimizer=adamw --lr=1.5e-4 --weight-decay=.1 \
--epochs=300 --warmup-epochs=40 \
--stop-grad-conv1 --moco-m-cos --moco-t=.2 \
--dist-url 'tcp://[your first node address]:[specified port]' \
--multiprocessing-distributed --world-size 8 --rank 0 \
[your imagenet-folder with train and val folders]
On other nodes, run the same command with --rank 1
, ..., --rank 7
respectively.
- The batch size specified by
-b
is the total batch size across all GPUs. - The learning rate specified by
--lr
is the base lr, and is adjusted by the linear lr scaling rule in this line. - Using a smaller batch size has a more stable result (see paper), but has lower speed. Using a large batch size is critical for good speed in TPUs (as we did in the paper).
- In this repo, only multi-gpu, DistributedDataParallel training is supported; single-gpu or DataParallel training is not supported. This code is improved to better suit the multi-node setting, and by default uses automatic mixed-precision for pre-training.
By default, we use momentum-SGD and a batch size of 1024 for linear classification on frozen features/weights. This can be done with a single 8-GPU node.
python main_lincls.py \
-a [architecture] --lr [learning rate] \
--dist-url 'tcp://localhost:10001' \
--multiprocessing-distributed --world-size 1 --rank 0 \
--pretrained [your checkpoint path]/[your checkpoint file].pth.tar \
[your imagenet-folder with train and val folders]
To perform end-to-end fine-tuning for ViT, use our script to convert the pre-trained ViT checkpoint to DEiT format:
python convert_to_deit.py \
--input [your checkpoint path]/[your checkpoint file].pth.tar \
--output [target checkpoint file].pth
Then run the training (in the DeiT repo) with the converted checkpoint:
python $DEIT_DIR/main.py \
--resume [target checkpoint file].pth \
--epochs 150
This gives us 83.2% accuracy for ViT-Base with 150-epoch fine-tuning.
Note:
- We use
--resume
rather than--finetune
in the DeiT repo, as its--finetune
option trains under eval mode. When loading the pre-trained model, revisemodel_without_ddp.load_state_dict(checkpoint['model'])
withstrict=False
. - Our ViT-Small is with
heads=12
in the Transformer block, while by default in DeiT it isheads=6
. Please modify the DeiT code accordingly when fine-tuning our ViT-Small model.
See the commands listed in CONFIG.md for specific model configs, including our recommended hyper-parameters and pre-trained reference models.
See the instructions in the transfer dir.
This project is under the CC-BY-NC 4.0 license. See LICENSE for details.
@Article{chen2021mocov3,
author = {Xinlei Chen* and Saining Xie* and Kaiming He},
title = {An Empirical Study of Training Self-Supervised Vision Transformers},
journal = {arXiv preprint arXiv:2104.02057},
year = {2021},
}