🏠[Project page] 📄[arXiv] 📄[PDF] 🔥[New Dataset Download]
This repository contains code for CVPR2023 paper:
GRES: Generalized Referring Expression Segmentation
Chang Liu, Henghui Ding, Xudong Jiang
CVPR 2023 Highlight, Acceptance Rate 2.5%
- (2023/08/29) We have updated and reorganized the dataset file. Please download the latest version for train/val/testA/testB! (Note: training expressions are unchanged so the this does not influence training. But some
ref_id
andsent_id
are re-numbered for better organization.) - (2023/08/16) A new large-scale referring video segmentation dataset MeViS is released.
The code is tested under CUDA 11.8, Pytorch 1.11.0 and Detectron2 0.6.
- Install Detectron2 following the manual
- Run
sh make.sh
undergres_model/modeling/pixel_decoder/ops
- Install other required packages:
pip -r requirements.txt
- Prepare the dataset following
datasets/DATASET.md
python train_net.py \
--config-file configs/referring_swin_base.yaml \
--num-gpus 8 --dist-url auto --eval-only \
MODEL.WEIGHTS [path_to_weights] \
OUTPUT_DIR [output_dir]
Firstly, download the backbone weights (swin_base_patch4_window12_384_22k.pkl
) and convert it into detectron2 format using the script:
wget https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_base_patch4_window12_384_22k.pth
python tools/convert-pretrained-swin-model-to-d2.py swin_base_patch4_window12_384_22k.pth swin_base_patch4_window12_384_22k.pkl
Then start training:
python train_net.py \
--config-file configs/referring_swin_base.yaml \
--num-gpus 8 --dist-url auto \
MODEL.WEIGHTS [path_to_weights] \
OUTPUT_DIR [path_to_weights]
Note: You can add your own configurations subsequently to the training command for customized options. For example:
SOLVER.IMS_PER_BATCH 48
SOLVER.BASE_LR 0.00001
For the full list of base configs, see configs/referring_R50.yaml
and configs/Base-COCO-InstanceSegmentation.yaml
Update: We have added supports for ResNet-50 and Swin-Tiny backbones! Feel free to use and report these resource-friendly models in your work.
Backbone | cIoU | gIoU |
---|---|---|
Resnet-50 | 39.53 | 38.62 |
Swin-Tiny | 57.73 | 56.86 |
Swin-Base | 62.42 | 63.60 |
All models can be downloaded from:
This project is based on refer, Mask2Former, Detectron2, VLT. Many thanks to the authors for their great works!
Please consider to cite GRES if it helps your research.
@inproceedings{GRES,
title={{GRES}: Generalized Referring Expression Segmentation},
author={Liu, Chang and Ding, Henghui and Jiang, Xudong},
booktitle={CVPR},
year={2023}
}
@article{VLT,
title={{VLT}: Vision-language transformer and query generation for referring segmentation},
author={Ding, Henghui and Liu, Chang and Wang, Suchen and Jiang, Xudong},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
year={2023},
publisher={IEEE}
}
@inproceedings{MeViS,
title={{MeViS}: A Large-scale Benchmark for Video Segmentation with Motion Expressions},
author={Ding, Henghui and Liu, Chang and He, Shuting and Jiang, Xudong and Loy, Chen Change},
booktitle={ICCV},
year={2023}
}