Unofficial PyTorch Implementation of Exploring Plain Vision Transformer Backbones for Object Detection

Results | Updates | Usage | Todo | Acknowledge

This branch contains the unofficial pytorch implementation of Exploring Plain Vision Transformer Backbones for Object Detection. Thanks for their wonderful work!

Results from this repo on COCO

The models are trained on 4 A100 machines with 2 images per gpu, which makes a batch size of 64 during training.

Model	Pretrain	Machine	FrameWork	Box mAP	Mask mAP	config	log	weight
ViT-Base	IN1K+MAE	TPU	Mask RCNN	51.1	45.5	config	log	OneDrive
ViT-Base	IN1K+MAE	GPU	Mask RCNN	51.1	45.4	config	log	OneDrive
ViTAE-Base	IN1K+MAE	GPU	Mask RCNN	51.6	45.8	config	log	OneDrive
ViTAE-Small	IN1K+Sup	GPU	Mask RCNN	45.6	40.1	config	log	OneDrive

Updates

[2022-04-18] Explore using small 1K supervised trained models (20M parameters) for ViTDet (45.6 mAP). The results with multi-stage structure is 46.0 mAP for Swin-T and 47.8 mAP for ViTAEv2-S with Mask RCNN on COCO.

[2022-04-17] Release the pretrained weights and logs for ViT-B and ViTAE-B on MS COCO. The models are totally trained with PyTorch on GPU.

[2022-04-16] Release the initial unofficial implementation of ViTDet with ViT-Base model! It obtains 51.1 mAP and 45.5 mAP on detection and segmentation, respectively. The weights and logs will be uploaded soon.

Applications of ViTAE Transformer include: image classification | object detection | semantic segmentation | animal pose segmentation | remote sensing | matting

Usage

We use PyTorch 1.9.0 or NGC docker 21.06, and mmcv 1.3.9 for the experiments.

git clone https://github.com/open-mmlab/mmcv.git
cd mmcv
git checkout v1.3.9
MMCV_WITH_OPS=1 pip install -e .
cd ..
git clone https://github.com/ViTAE-Transformer/ViTDet.git
cd ViTDet
pip install -v -e .

After install the two repos, install timm and einops, i.e.,

pip install timm==0.4.9 einops

Download the pretrained models from MAE or ViTAE, and then conduct the experiments by

# for single machine
bash tools/dist_train.sh <Config PATH> <NUM GPUs> --cfg-options model.pretrained=<Pretrained PATH>

# for multiple machines
python -m torch.distributed.launch --nnodes <Num Machines> --node_rank <Rank of Machine> --nproc_per_node <GPUs Per Machine> --master_addr <Master Addr> --master_port <Master Port> tools/train.py <Config PATH> --cfg-options model.pretrained=<Pretrained PATH> --launcher pytorch

Todo

This repo current contains modifications including:

using LN for the convolutions in RPN and heads
using large scale jittor for augmentation
using RPE from MViT
using longer training epochs and 1024 test size
using global attention layers

There are other things to do:

Implement the conv blocks for global information communication
Tune the models for Cascade RCNN
Train ViT models for the LVIS dataset
Train ViTAE model with the ViTDet framework

Acknowledge

We acknowledge the excellent implementation from mmdetection, MAE, MViT, and BeiT.

Citing ViTDet

@article{Li2022ExploringPV,
  title={Exploring Plain Vision Transformer Backbones for Object Detection},
  author={Yanghao Li and Hanzi Mao and Ross B. Girshick and Kaiming He},
  journal={ArXiv},
  year={2022},
  volume={abs/2203.16527}
}

For ViTAE and ViTAEv2, please refer to:

@article{xu2021vitae,
  title={Vitae: Vision transformer advanced by exploring intrinsic inductive bias},
  author={Xu, Yufei and Zhang, Qiming and Zhang, Jing and Tao, Dacheng},
  journal={Advances in Neural Information Processing Systems},
  volume={34},
  year={2021}
}

@article{zhang2022vitaev2,
  title={ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond},
  author={Zhang, Qiming and Xu, Yufei and Zhang, Jing and Tao, Dacheng},
  journal={arXiv preprint arXiv:2202.10108},
  year={2022}
}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.dev_scripts		.dev_scripts
configs		configs
demo		demo
docker		docker
docs		docs
docs_zh-CN		docs_zh-CN
logs		logs
mmcv_custom		mmcv_custom
mmdet		mmdet
requirements		requirements
resources		resources
tests		tests
tools		tools
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yml		.readthedocs.yml
CITATION.cff		CITATION.cff
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
model-index.yml		model-index.yml
pytest.ini		pytest.ini
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Unofficial PyTorch Implementation of Exploring Plain Vision Transformer Backbones for Object Detection

Results from this repo on COCO

Updates

Usage

Todo

Acknowledge

Citing ViTDet

About

Releases

Packages

Languages

License

yestinl/ViTDet

Folders and files

Latest commit

History

Repository files navigation

Unofficial PyTorch Implementation of Exploring Plain Vision Transformer Backbones for Object Detection

Results from this repo on COCO

Updates

Usage

Todo

Acknowledge

Citing ViTDet

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages