UniVS MODEL ZOO

1. Pretrained Models from Mask2Former

We use the pretrained models of Mask2Former trained on COCO panoptic segmentation datasets.

If you want to train UniVS, please download them in pretrained/m2f_panseg/ dir.

Backbone	Model id	Model
ResNet-50	47430278_4	model
Swin-Tiny	48558700_1	model
Swin-Base	48558700_7	model
Swin-Large	47429163_0	model

2. Prepare CLIP Text Encoder & Category Embeddings

2.1 Download pretrained model of CLIP Text Encoder from Google drive and put it in the dir `'pretrained/regionclip/regionclip/'`

2.2 Prepare Category Embeddings for UniVS

a) Extract all category names that existed in the following datasets, please refer to datasets/concept_emb/combined_datasets.txt and datasets/concept_emb/combined_datasets_category_info.py

b) Extract concept embeddings of category names from CLIP Text Encoder. For your convenience, you can directly download the converted file from Google drive and put it in the dir 'datasets/concept_emb/' .

c) Alternatively, you can run the code to generate in your server

$ cd UniVS
$ sh tools/clip_concept_extraction/extract_concept_emb.sh

3. UniVS Models

UniVS achieves superior performance on 10 benchmarks, using the same model with the same model parameters. UniVS has three training stages: image-level joint training, video-level joint training, and long video-level joint training. We provide all the checkpoints of all stages for models with different backbones.

If you want to evaluate UniVS on different stages, please download them to 'output/stage{1,2,3}/' dirs respectively.

Stage 2: Video-level Joint Training

Note that the input image for Swin-Tiny/Base/Large backbones must have a shape of 1024 x 1024.

Backbone	YAML	INPUT	Model
ResNet-50	univs_r50_stage2	Any resolution	model
Swin-Tiny	univs_swint_stage2	1024 * 1024	model
Swin-Base	univs_swinb_stage2	1024 * 1024	model
Swin-Large	univs_swinl_stage2	1024 * 1024	model

Stage 3: Long Video-level Joint Training

All numbers reported in the paper uses the following models. This stage supports input images of all aspect ratios, and the results perform better when the short side is between 512 and 720.

Backbone	YAML	Input	Model
ResNet-50	univs_r50_stage3	Any resolution	model
Swin-Tiny	univs_swint_stage3	Any resolution	model
Swin-Base	univs_swinb_stage3	Any resolution	model
Swin-Large	univs_swinl_stage3	Any resolution	model

4. Inference UniVS on development set

If the ground-truth for the validation set has already been released, as in the case of the DAVIS and VSPW benchmarks, we will use the first 20% of the data from the validation set for the development set to enable rapid inference. If the ground-truth for the validation set has not been released, such as YouTube-VIS benchmark, we will divide the training set into two parts: training (train_sub.json) and development (valid_sub.json) sets. In this scenario, the data in the development set will not be seen during the training phase.

For your convenience in evaluation, we provide the converted development annotation files, which you can download from here. After downloading, please unzip the file and store it according to the following structure:

datasets/
  |---ytvis_2021/
    |---train.json  
    |---train_sub.json  (90% videos in training set)
    |---valid_sub.json  (10% videos in training set)
    |---train/
      |---JPEGImages/
    |---valid/
      |---JPEGImages/

  |---ovis/
    |---train.json  
    |---train_sub.json  (90% videos in training set)
    |---valid_sub.json  (10% videos in training set)
    |---train/
      |---JPEGImages/
    |---valid/
      |---JPEGImages/

  |---VSPW_480p/
      |---val_cocovid.json
      |---dev_cocovid.json  (first 50 videos in val set, only for debug)
      |---data/

  |---vipseg/
    |---VIPSeg_720P/
      |--- panoptic_gt_VIPSeg_val_cocovid.json
      |--- panoptic_gt_VIPSeg_val_sub_cocovid.json
      |--- imgs/
      |--- panomasksRGB/

  |---DAVIS/
    2017_val.json
    |---JPEGImages/
      |---Full-Resolution/

  |---viposeg/
    |---valid/
      |---valid_cocovid.json
      |---dev_cocovid.json 
      |---JPEGImages/

  |---ref-davis/
    |---valid_0.json
    |---valid_1.json
    |---valid_2.json
    |---valid_3.json
    |---valid/
      |---JPEGImages/

The results evaluated on the development sets are presented below. You can obtain these results by running the test scripts located in the tools/test/ directory.

Note that the results from this section are solely applicable for code debugging and should not be used for performance comparison with other methods in the paper.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model_zoo.md

Model_zoo.md

UniVS MODEL ZOO

1. Pretrained Models from Mask2Former

2. Prepare CLIP Text Encoder & Category Embeddings

2.1 Download pretrained model of CLIP Text Encoder from Google drive and put it in the dir `'pretrained/regionclip/regionclip/'`

2.2 Prepare Category Embeddings for UniVS

3. UniVS Models

Stage 2: Video-level Joint Training

Stage 3: Long Video-level Joint Training

4. Inference UniVS on development set

Files

Model_zoo.md

Latest commit

History

Model_zoo.md

File metadata and controls

UniVS MODEL ZOO

1. Pretrained Models from Mask2Former

2. Prepare CLIP Text Encoder & Category Embeddings

2.1 Download pretrained model of CLIP Text Encoder from Google drive and put it in the dir 'pretrained/regionclip/regionclip/'

2.2 Prepare Category Embeddings for UniVS

3. UniVS Models

Stage 2: Video-level Joint Training

Stage 3: Long Video-level Joint Training

4. Inference UniVS on development set

2.1 Download pretrained model of CLIP Text Encoder from Google drive and put it in the dir `'pretrained/regionclip/regionclip/'`