Skip to content

Latest commit





Applying PVT to Object Detection

Our detection code is developed on top of MMDetection v2.13.0.

For details see Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions.

If you use this code for a paper please cite:


      title={Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions}, 
      author={Wenhai Wang and Enze Xie and Xiang Li and Deng-Ping Fan and Kaitao Song and Ding Liang and Tong Lu and Ping Luo and Ling Shao},


      title={PVTv2: Improved Baselines with Pyramid Vision Transformer}, 
      author={Wenhai Wang and Enze Xie and Xiang Li and Deng-Ping Fan and Kaitao Song and Ding Liang and Tong Lu and Ping Luo and Ling Shao},


Install MMDetection v2.13.0.


pip install mmdet==2.13.0 --user

Apex (optional):

git clone
cd apex
python install --cpp_ext --cuda_ext --user

If you would like to disable apex, modify the type of runner as EpochBasedRunner and comment out the following code block in the configuration files:

fp16 = None
optimizer_config = dict(

Data preparation

Prepare COCO according to the guidelines in MMDetection v2.13.0.

Results and models

  • PVTv2 on COCO
Method Backbone Pretrain Lr schd Aug box AP mask AP Config Download
RetinaNet PVTv2-b0 ImageNet-1K 1x No 37.2 - config log & model
RetinaNet PVTv2-b1 ImageNet-1K 1x No 41.2 - config log & model
RetinaNet PVTv2-b2-li ImageNet-1K 1x No 43.6 - config log & model
RetinaNet PVTv2-b2 ImageNet-1K 1x No 44.6 - config log & model
RetinaNet PVTv2-b3 ImageNet-1K 1x No 45.9 - config log & model
RetinaNet PVTv2-b4 ImageNet-1K 1x No 46.1 - config log & model
RetinaNet PVTv2-b5 ImageNet-1K 1x No 46.2 - config log & model
Mask R-CNN PVTv2-b0 ImageNet-1K 1x No 38.2 36.2 config log & model
Mask R-CNN PVTv2-b1 ImageNet-1K 1x No 41.8 38.8 config log & model
Mask R-CNN PVTv2-b2-li ImageNet-1K 1x No 44.1 40.5 config log & model
Mask R-CNN PVTv2-b2 ImageNet-1K 1x No 45.3 41.2 config log & model
Mask R-CNN PVTv2-b3 ImageNet-1K 1x No 47.0 42.5 config log & model
Mask R-CNN PVTv2-b4 ImageNet-1K 1x No 47.5 42.7 config log & model
Mask R-CNN PVTv2-b5 ImageNet-1K 1x No 47.4 42.5 config log & model
Method Backbone Pretrain Lr schd Aug box AP mask AP Config Download
Mask R-CNN PVTv2-b0 ImageNet-1K 3x Yes 41.6 38.2 config log & model
Mask R-CNN PVTv2-b2 ImageNet-1K 3x Yes 47.8 43.1 config log & model
Method Backbone Pretrain Lr schd Aug box AP mask AP Config Download
Cascade Mask R-CNN PVTv2-b2-Linear ImageNet-1K 3x Yes 50.9 44.0 config log & model
Cascade Mask R-CNN PVTv2-b2 ImageNet-1K 3x Yes 51.1 44.4 config log & model
ATSS PVTv2-b2-Linear ImageNet-1K 3x Yes 48.9 - config log & model
ATSS PVTv2-b2 ImageNet-1K 3x Yes 49.9 - config log & model
GFL PVTv2-b2-Linear ImageNet-1K 3x Yes 49.2 - config log & model
GFL PVTv2-b2 ImageNet-1K 3x Yes 50.2 - config log & model
Sparse R-CNN PVTv2-b2-Linear ImageNet-1K 3x Yes 48.9 - config log & model
Sparse R-CNN PVTv2-b2 ImageNet-1K 3x Yes 50.1 - config log & model
  • PVTv1 on COCO
Method Backbone Pretrain Lr schd box AP mask AP Config Download
RetinaNet PVT-Tiny ImageNet-1K 1x 36.7 - config log & model
RetinaNet (640x) PVT-Small ImageNet-1K 1x 38.7 - config log & model
RetinaNet (800x) PVT-Small ImageNet-1K 1x 40.4 - config log & model
RetinaNet PVT-Medium ImageNet-1K 1x 41.9 - config log & model
RetinaNet PVT-Large ImageNet-1K 1x 42.6 - config log & model
Mask R-CNN PVT-Tiny ImageNet-1K 1x 36.7 35.1 config log & model
Mask R-CNN PVT-Small ImageNet-1K 1x 40.4 37.8 config log & model
Mask R-CNN PVT-Medium ImageNet-1K 1x 42.0 39.0 config log & model
Mask R-CNN PVT-Large ImageNet-1K 1x 42.9 39.5 config log & model
DETR PVT-Small ImageNet-1K 50ep 34.7 - config log & model


To evaluate PVT-Small + RetinaNet (640x) on COCO val2017 on a single node with 8 gpus run: configs/ /path/to/checkpoint_file 8 --out results.pkl --eval bbox

This should give

Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.387
Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=1000 ] = 0.593
Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=1000 ] = 0.408
Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.212
Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.416
Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.544
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.545
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=300 ] = 0.545
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=1000 ] = 0.545
Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=1000 ] = 0.329
Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] = 0.583
Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=1000 ] = 0.721


To train PVT-Small + RetinaNet (640x) on COCO train2017 on a single node with 8 gpus for 12 epochs run: configs/ 8


python demo.jpg /path/to/config_file /path/to/checkpoint_file

Calculating FLOPS & Params

python configs/

This should give

Input shape: (3, 1280, 800)
Flops: 260.65 GFLOPs
Params: 33.11 M


This repository is released under the Apache 2.0 license as found in the LICENSE file.