Skip to content

Latest commit

 

History

History
193 lines (165 loc) · 7.27 KB

Model_zoo.md

File metadata and controls

193 lines (165 loc) · 7.27 KB

UniVS MODEL ZOO

1. Pretrained Models from Mask2Former

We use the pretrained models of Mask2Former trained on COCO panoptic segmentation datasets.

If you want to train UniVS, please download them in pretrained/m2f_panseg/ dir.

Backbone Model id Model
ResNet-50 47430278_4 model
Swin-Tiny 48558700_1 model
Swin-Base 48558700_7 model
Swin-Large 47429163_0 model

2. Prepare CLIP Text Encoder & Category Embeddings

2.1 Download pretrained model of CLIP Text Encoder from Google drive and put it in the dir 'pretrained/regionclip/regionclip/'

2.2 Prepare Category Embeddings for UniVS

a) Extract all category names that existed in the following datasets, please refer to datasets/concept_emb/combined_datasets.txt and datasets/concept_emb/combined_datasets_category_info.py

b) Extract concept embeddings of category names from CLIP Text Encoder. For your convenience, you can directly download the converted file from Google drive and put it in the dir 'datasets/concept_emb/' .

c) Alternatively, you can run the code to generate in your server

$ cd UniVS
$ sh tools/clip_concept_extraction/extract_concept_emb.sh

3. UniVS Models

UniVS achieves superior performance on 10 benchmarks, using the same model with the same model parameters. UniVS has three training stages: image-level joint training, video-level joint training, and long video-level joint training. We provide all the checkpoints of all stages for models with different backbones.

If you want to evaluate UniVS on different stages, please download them to 'output/stage{1,2,3}/' dirs respectively.

Stage 2: Video-level Joint Training

Note that the input image for Swin-Tiny/Base/Large backbones must have a shape of 1024 x 1024.

Backbone YAML INPUT Model
ResNet-50 univs_r50_stage2 Any resolution model
Swin-Tiny univs_swint_stage2 1024 * 1024 model
Swin-Base univs_swinb_stage2 1024 * 1024 model
Swin-Large univs_swinl_stage2 1024 * 1024 model

Stage 3: Long Video-level Joint Training

All numbers reported in the paper uses the following models. This stage supports input images of all aspect ratios, and the results perform better when the short side is between 512 and 720.

Backbone YAML Input Model
ResNet-50 univs_r50_stage3 Any resolution model
Swin-Tiny univs_swint_stage3 Any resolution model
Swin-Base univs_swinb_stage3 Any resolution model
Swin-Large univs_swinl_stage3 Any resolution model

4. Inference UniVS on development set

If the ground-truth for the validation set has already been released, as in the case of the DAVIS and VSPW benchmarks, we will use the first 20% of the data from the validation set for the development set to enable rapid inference. If the ground-truth for the validation set has not been released, such as YouTube-VIS benchmark, we will divide the training set into two parts: training (train_sub.json) and development (valid_sub.json) sets. In this scenario, the data in the development set will not be seen during the training phase.

For your convenience in evaluation, we provide the converted development annotation files, which you can download from here. After downloading, please unzip the file and store it according to the following structure:

datasets/
  |---ytvis_2021/
    |---train.json  
    |---train_sub.json  (90% videos in training set)
    |---valid_sub.json  (10% videos in training set)
    |---train/
      |---JPEGImages/
    |---valid/
      |---JPEGImages/

  |---ovis/
    |---train.json  
    |---train_sub.json  (90% videos in training set)
    |---valid_sub.json  (10% videos in training set)
    |---train/
      |---JPEGImages/
    |---valid/
      |---JPEGImages/

  |---VSPW_480p/
      |---val_cocovid.json
      |---dev_cocovid.json  (first 50 videos in val set, only for debug)
      |---data/

  |---vipseg/
    |---VIPSeg_720P/
      |--- panoptic_gt_VIPSeg_val_cocovid.json
      |--- panoptic_gt_VIPSeg_val_sub_cocovid.json
      |--- imgs/
      |--- panomasksRGB/

  |---DAVIS/
    2017_val.json
    |---JPEGImages/
      |---Full-Resolution/

  |---viposeg/
    |---valid/
      |---valid_cocovid.json
      |---dev_cocovid.json 
      |---JPEGImages/

  |---ref-davis/
    |---valid_0.json
    |---valid_1.json
    |---valid_2.json
    |---valid_3.json
    |---valid/
      |---JPEGImages/

The results evaluated on the development sets are presented below. You can obtain these results by running the test scripts located in the tools/test/ directory.

Note that the results from this section are solely applicable for code debugging and should not be used for performance comparison with other methods in the paper.