PyTorch implementation for "Patch-level Representation Learning for Self-supervised Vision Transformers" (accepted Oral presentation in CVPR 2022)
torch==1.7.0
torchvision==0.8.1
python -m torch.distributed.launch --nproc_per_node=8 main_selfpatch.py --arch vit_small --data_path /path/to/imagenet/train --output_dir /path/to/saving_dir --local_crops_number 8 --patch_size 16 --batch_size_per_gpu 128 --out_dim_selfpatch 4096 --k_num 4
You can download the weights of the pretrained models on ImageNet. All models are trained on ViT-S/16
. For detection and segmentation downstream tasks, please check SelfPatch/detection, SelfPatch/segmentation.
backbone | arch | checkpoint |
---|---|---|
DINO | ViT-S/16 | download (pretrained model from VISSL) |
DINO + SelfPatch | ViT-S/16 | download |
Step 1. Prepare DAVIS 2017 data
cd $HOME
git clone https://github.com/davisvideochallenge/davis-2017
cd davis-2017
./data/get_davis.sh
Step 2. Run Video object segmentation
python eval_video_segmentation.py --data_path /path/to/davis-2017/DAVIS/ --output_dir /path/to/saving_dir --pretrained_weights /path/to/model_dir --arch vit_small --patch_size 16
Step 3. Evaluate the obtained segmentation
git clone https://github.com/davisvideochallenge/davis2017-evaluation
$HOME/davis2017-evaluation
python /path/to/davis2017-evaluation/evaluation_method.py --task semi-supervised --davis_path /path/to/davis-2017/DAVIS --results_path /path/to/saving_dir
Video (left), DINO (middle) and our SelfPatch (right)
Our code base is built partly upon the packages: DINO, mmdetection, mmsegmentation and XCiT
If you use this code for your research, please cite our papers.
@InProceedings{Yun_2022_CVPR,
author = {Yun, Sukmin and Lee, Hankook and Kim, Jaehyung and Shin, Jinwoo},
title = {Patch-Level Representation Learning for Self-Supervised Vision Transformers},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2022},
pages = {8354-8363}
}