1. Pretrained Models from Mask2Former
We use the pretrained models of Mask2Former trained on COCO panoptic segmentation datasets.
If you want to train UniVS, please download them in pretrained/m2f_panseg/
dir.
Backbone | Model id | Model |
---|---|---|
ResNet-50 | 47430278_4 | model |
Swin-Tiny | 48558700_1 | model |
Swin-Base | 48558700_7 | model |
Swin-Large | 47429163_0 | model |
2.1 Download pretrained model of CLIP Text Encoder from Google drive and put it in the dir 'pretrained/regionclip/regionclip/'
a) Extract all category names that existed in the following datasets, please refer to datasets/concept_emb/combined_datasets.txt
and datasets/concept_emb/combined_datasets_category_info.py
b) Extract concept embeddings of category names from CLIP Text Encoder. For your convenience, you can directly download the converted file from Google drive and put it in the dir 'datasets/concept_emb/'
.
c) Alternatively, you can run the code to generate in your server
$ cd UniVS
$ sh tools/clip_concept_extraction/extract_concept_emb.sh
UniVS achieves superior performance on 10 benchmarks, using the same model with the same model parameters. UniVS has three training stages: image-level joint training, video-level joint training, and long video-level joint training. We provide all the checkpoints of all stages for models with different backbones.
If you want to evaluate UniVS on different stages, please download them to 'output/stage{1,2,3}/'
dirs respectively.
Note that the input image for Swin-Tiny/Base/Large backbones must have a shape of 1024 x 1024.
Backbone | YAML | INPUT | Model |
---|---|---|---|
ResNet-50 | univs_r50_stage2 | Any resolution | model |
Swin-Tiny | univs_swint_stage2 | 1024 * 1024 | model |
Swin-Base | univs_swinb_stage2 | 1024 * 1024 | model |
Swin-Large | univs_swinl_stage2 | 1024 * 1024 | model |
All numbers reported in the paper uses the following models. This stage supports input images of all aspect ratios, and the results perform better when the short side is between 512 and 720.
Backbone | YAML | Input | Model |
---|---|---|---|
ResNet-50 | univs_r50_stage3 | Any resolution | model |
Swin-Tiny | univs_swint_stage3 | Any resolution | model |
Swin-Base | univs_swinb_stage3 | Any resolution | model |
Swin-Large | univs_swinl_stage3 | Any resolution | model |
If the ground-truth for the validation set has already been released, as in the case of the DAVIS and VSPW benchmarks, we will use the first 20% of the data from the validation set for the development set to enable rapid inference. If the ground-truth for the validation set has not been released, such as YouTube-VIS benchmark, we will divide the training set into two parts: training (train_sub.json) and development (valid_sub.json) sets. In this scenario, the data in the development set will not be seen during the training phase.
For your convenience in evaluation, we provide the converted development annotation files, which you can download from here. After downloading, please unzip the file and store it according to the following structure:
datasets/
|---ytvis_2021/
|---train.json
|---train_sub.json (90% videos in training set)
|---valid_sub.json (10% videos in training set)
|---train/
|---JPEGImages/
|---valid/
|---JPEGImages/
|---ovis/
|---train.json
|---train_sub.json (90% videos in training set)
|---valid_sub.json (10% videos in training set)
|---train/
|---JPEGImages/
|---valid/
|---JPEGImages/
|---VSPW_480p/
|---val_cocovid.json
|---dev_cocovid.json (first 50 videos in val set, only for debug)
|---data/
|---vipseg/
|---VIPSeg_720P/
|--- panoptic_gt_VIPSeg_val_cocovid.json
|--- panoptic_gt_VIPSeg_val_sub_cocovid.json
|--- imgs/
|--- panomasksRGB/
|---DAVIS/
2017_val.json
|---JPEGImages/
|---Full-Resolution/
|---viposeg/
|---valid/
|---valid_cocovid.json
|---dev_cocovid.json
|---JPEGImages/
|---ref-davis/
|---valid_0.json
|---valid_1.json
|---valid_2.json
|---valid_3.json
|---valid/
|---JPEGImages/
The results evaluated on the development sets are presented below. You can obtain these results by running the test scripts located in the tools/test/
directory.
Note that the results from this section are solely applicable for code debugging and should not be used for performance comparison with other methods in the paper.