Skip to content

Latest commit

 

History

History
452 lines (438 loc) · 24.3 KB

MODEL_ZOO.md

File metadata and controls

452 lines (438 loc) · 24.3 KB

Model Zoo

All the model weights are saved with the clip_teacher, which are loaded from the CLIP vision encoder.

Pretraining

We load those models with K710 pretraining (Stage1) and further pretrain them on multimodality data (Stage2).

  • 5M: CC3M + WebVid2M
  • 17M: CC3M + CC12M + COCO + VG + SBU + WebVid2M
  • 25M: CC3M + CC12M + COCO + VG + SBU + WebVid10M
Model Setting Model Script
UMT-B/16 5M ckpt script
UMT-B/16 17M ckpt script
UMT-B/16 25M ckpt script
UMT-L/16 5M ckpt script
UMT-L/16 17M ckpt script
UMT-L/16 25M ckpt script

Zero-shot Evaluation

DatasetRetrievalUMT-B/16UMT-L/16
5M17M25M5M17M25M
MSRVTT T2V R@1: 29.6
R@5: 52.8
R@10: 61.9
R@1: 35.5
R@5: 59.3
R@10: 68.6
R@1: 35.2
R@5: 57.8
R@10: 66.0
R@1: 33.3
R@5: 58.1
R@10: 66.7
R@1: 42.6
R@5: 64.4
R@10: 73.1
R@1: 40.7
R@5: 63.4
R@10: 71.8
V2T R@1: 26.2
R@5: 46.7
R@10: 54.9
R@1: 31.6
R@5: 53.5
R@10: 64.1
R@1: 30.3
R@5: 50.7
R@10: 61.4
R@1: 30.2
R@5: 51.3
R@10: 61.6
R@1: 38.6
R@5: 59.8
R@10: 69.6
R@1: 37.1
R@5: 58.7
R@10: 68.9
Material script script script script script script
DiDeMo T2V R@1: 33.4
R@5: 58.3
R@10: 67.0
R@1: 41.9
R@5: 66.7
R@10: 75.0
R@1: 41.2
R@5: 65.4
R@10: 74.9
R@1: 34.0
R@5: 60.4
R@10: 68.7
R@1: 46.4
R@5: 70.0
R@10: 78.8
R@1: 48.6
R@5: 72.9
R@10: 79.0
V2T R@1: 32.0
R@5: 58.7
R@10: 68.2
R@1: 40.3
R@5: 66.6
R@10: 75.8
R@1: 40.8
R@5: 67.7
R@10: 76.7
R@1: 36.2
R@5: 60.0
R@10: 68.6
R@1: 46.5
R@5: 72.2
R@10: 79.5
R@1: 49.9
R@5: 74.8
R@10: 81.4
Material script script script script script script
ActivityNet T2V R@1: 28.3
R@5: 53.0
R@10: 64.2
R@1: 33.8
R@5: 59.1
R@10: 70.4
R@1: 35.5
R@5: 60.6
R@10: 71.8
R@1: 31.9
R@5: 60.2
R@10: 72.0
R@1: 42.8
R@5: 69.6
R@10: 79.8
R@1: 41.9
R@5: 68.9
R@10: 80.3
V2T R@1: 25.9
R@5: 50.2
R@10: 61.7
R@1: 31.6
R@5: 56.2
R@10: 67.9
R@1: 32.8
R@5: 57.6
R@10: 69.2
R@1: 30.0
R@5: 59.1
R@10: 71.3
R@1: 40.7
R@5: 67.6
R@10: 78.6
R@1: 39.4
R@5: 66.8
R@10: 78.3
Material script script script script script script
LSMDC T2V R@1: 16.8
R@5: 30.5
R@10: 37.6
R@1: 18.1
R@5: 33.1
R@10: 40.0
R@1: 19.1
R@5: 33.4
R@10: 42.2
R@1: 20.0
R@5: 37.2
R@10: 43.7
R@1: 25.2
R@5: 43.0
R@10: 50.5
R@1: 24.9
R@5: 41.7
R@10: 51.8
V2T R@1: 12.9
R@5: 27.4
R@10: 33.6
R@1: 16.0
R@5: 29.9
R@10: 35.7
R@1: 15.7
R@5: 30.6
R@10: 37.4
R@1: 16.1
R@5: 32.0
R@10: 39.2
R@1: 23.2
R@5: 37.7
R@10: 44.2
R@1: 21.9
R@5: 37.8
R@10: 45.7
Material script script script script script script
MSVD T2V R@1: 36.2
R@5: 65.7
R@10: 76.1
R@1: 41.4
R@5: 70.6
R@10: 80.1
R@1: 42.3
R@5: 71.7
R@10: 80.8
R@1: 44.4
R@5: 73.3
R@10: 82.4
R@1: 49.9
R@5: 77.7
R@10: 85.3
R@1: 49.0
R@5: 76.9
R@10: 84.7
V2T R@1: 58.5
R@5: 78.7
R@10: 84.3
R@1: 62.5
R@5: 80.8
R@10: 87.0
R@1: 61.9
R@5: 82.5
R@10: 88.5
R@1: 66.1
R@5: 85.5
R@10: 89.4
R@1: 75.4
R@5: 89.6
R@10: 94.0
R@1: 74.5
R@5: 89.7
R@10: 92.8
Material script script script script script script

Finetuning

Video-Text Retrieval

DatasetRetrievalUMT-B/16UMT-L/16
5M17M25M5M17M25M
MSRVTT T2V R@1: 46.3
R@5: 72.7
R@10: 82.0
R@1: 50.6
R@5: 75.4
R@10: 83.5
R@1: 51.0
R@5: 76.5
R@10: 84.2
R@1: 53.3
R@5: 76.6
R@10: 83.9
R@1: 56.5
R@5: 80.1
R@10: 87.4
R@1: 58.8
R@5: 81.0
R@10: 87.1
V2T R@1: 44.4
R@5: 72.8
R@10: 80.7
R@1: 49.4
R@5: 76.7
R@10: 83.5
R@1: 49.0
R@5: 77.0
R@10: 84.7
R@1: 51.4
R@5: 76.3
R@10: 82.8
R@1: 56.7
R@5: 79.6
R@10: 86.7
R@1: 58.6
R@5: 81.6
R@10: 86.5
Material script script script [ckpt] script script script [ckpt]
DiDeMo T2V R@1: 54.8
R@5: 83.0
R@10: 89.0
R@1: 60.8
R@5: 85.1
R@10: 91.0
R@1: 61.6
R@5: 86.8
R@10: 91.5
R@1: 59.7
R@5: 84.9
R@10: 90.8
R@1: 66.6
R@5: 89.9
R@10: 93.7
R@1: 70.4
R@5: 90.1
R@10: 93.5
V2T R@1: 52.9
R@5: 80.2
R@10: 85.8
R@1: 59.5
R@5: 83.8
R@10: 90.7
R@1: 59.5
R@5: 84.9
R@10: 90.5
R@1: 59.5
R@5: 84.5
R@10: 90.7
R@1: 66.4
R@5: 87.5
R@10: 92.9
R@1: 65.7
R@5: 89.6
R@10: 93.3
Material script script script [ckpt] script script script [ckpt]
ActivityNet T2V R@1: 52.1
R@5: 80.5
R@10: 89.6
R@1: 56.1
R@5: 82.5
R@10: 91.2
R@1: 58.3
R@5: 83.9
R@10: 91.5
R@1: 58.1
R@5: 85.5
R@10: 92.9
R@1: 66.6
R@5: 88.6
R@10: 94.7
R@1: 66.8
R@5: 89.1
R@10: 94.9
V2T R@1: 50.0
R@5: 79.8
R@10: 88.2
R@1: 54.6
R@5: 82.1
R@10: 91.1
R@1: 56.0
R@5: 83.5
R@10: 91.7
R@1: 55.4
R@5: 84.4
R@10: 92.9
R@1: 64.3
R@5: 87.8
R@10: 94.8
R@1: 64.4
R@5: 89.1
R@10: 94.8
Material script script script [ckpt] script script script [ckpt]
LSMDC T2V R@1: 30.3
R@5: 51.8
R@10: 61.4
R@1: 32.3
R@5: 54.5
R@10: 61.9
R@1: 32.7
R@5: 54.7
R@10: 63.4
R@1: 37.7
R@5: 60.6
R@10: 67.3
R@1: 41.4
R@5: 63.8
R@10: 72.3
R@1: 43.0
R@5: 65.5
R@10: 73.0
V2T R@1: 29.8
R@5: 52.2
R@10: 60.5
R@1: 31.5
R@5: 53.6
R@10: 61.9
R@1: 32.7
R@5: 53.5
R@10: 63.2
R@1: 36.2
R@5: 58.9
R@10: 65.7
R@1: 40.3
R@5: 63.1
R@10: 71.1
R@1: 41.4
R@5: 64.3
R@10: 71.5
Material script script script [ckpt] script script script [ckpt]
MSVD T2V R@1: 47.4
R@5: 76.8
R@10: 84.0
R@1: 49.6
R@5: 78.5
R@10: 85.7
R@1: 50.8
R@5: 79.7
R@10: 86.2
R@1: 53.7
R@5: 80.5
R@10: 86.8
R@1: 57.4
R@5: 83.0
R@10: 88.5
R@1: 58.2
R@5: 83.9
R@10: 89.6
V2T R@1: 69.1
R@5: 85.8
R@10: 92.1
R@1: 71.6
R@5: 88.8
R@10: 92.7
R@1: 73.3
R@5: 89.6
R@10: 93.7
R@1: 77.2
R@5: 91.6
R@10: 94.8
R@1: 82.4
R@5: 93.6
R@10: 96.0
R@1: 82.4
R@5: 94.6
R@10: 96.7
Material script script script [ckpt] script script script [ckpt]
SSV2-
label
T2V R@1: 63.1
R@5: 87.1
R@10: 92.3
R@1: 63.4
R@5: 88.0
R@10: 92.9
R@1: 64.2
R@5: 88.2
R@10: 92.7
R@1: 70.5
R@5: 92.4
R@10: 95.5
R@1: 73.1
R@5: 93.2
R@10: 96.4
R@1: 73.3
R@5: 92.7
R@10: 96.9
Material script script script [ckpt] script script script [ckpt]
SSV2-
template
T2V R@1: 87.3
R@5: 100
R@10: 100
R@1: 86.8
R@5: 99.4
R@10: 100
R@1: 87.9
R@5: 99.4
R@10: 100
R@1: 90.2
R@5: 99.4
R@10: 100
R@1: 90.8
R@5: 100
R@10: 100
R@1: 90.8
R@5: 99.4
R@10: 100
Material script script script [ckpt] script script script [ckpt]

Video Question Answering

DatasetUMT-B/16UMT-L/16
5M17M25M5M17M25M
ActivityNet-QA 43.5 44.9 44.8 45.1 47.3 47.9
script script script [ckpt] script script script [ckpt]
MSRVTT-QA 44.3 44.9 44.9 45.5 46.4 47.1
script script script [ckpt] script script script [ckpt]
MSRVTT-MC 95.9 96.3 96.3 96.8 97.7 97.3
script script script script script script
MSVD-QA 49.1 48.9 49.5 51.3 53.4 55.2
script script script [ckpt] script script script [ckpt]