Skip to content

Latest commit

 

History

History
97 lines (89 loc) · 4.26 KB

MODEL_ZOO.md

File metadata and controls

97 lines (89 loc) · 4.26 KB

Model Zoo

All the model weights are saved with the clip_teacher, which are loaded from the CLIP vision encoder.

Pretraining

We load those models with K400 masked pretraining and further pretrain them on multimodality data.

  • 5M: CC3M + WebVid2M
  • 17M: CC3M + CC12M + COCO + VG + SBU + WebVid2M
  • 25M: CC3M + CC12M + COCO + VG + SBU + WebVid10M
Model Setting Model Script
VideoMamba-M 5M aliyun, 🤗HF script
VideoMamba-M 17M aliyun, 🤗HF script
VideoMamba-M 25M aliyun, 🤗HF script

Zero-shot Evaluation

DatasetRetrievalVideoMamba-M
5M17M25M
MSRVTT T2V R@1: 32.0
R@5: 53.1
R@10: 63.6
R@1: 34.7
R@5: 58.9
R@10: 68.0
R@1: 35.6
R@5: 58.1
R@10: 69.5
V2T R@1: 28.2
R@5: 47.6
R@10: 56.5
R@1: 29.5
R@5: 49.9
R@10: 60.1
R@1: 29.1
R@5: 51.6
R@10: 62.2
DiDeMo T2V R@1: 36.6
R@5: 61.7
R@10: 70.3
R@1: 42.0
R@5: 67.3
R@10: 76.8
R@1: 43.1
R@5: 68.1
R@10: 77.7
V2T R@1: 38.3
R@5: 64.7
R@10: 73.3
R@1: 42.3
R@5: 68.2
R@10: 76.9
R@1: 43.8
R@5: 69.7
R@10: 77.8
ActivityNet T2V R@1: 35.9
R@5: 61.1
R@10: 72.3
R@1: 40.1
R@5: 65.7
R@10: 76.1
R@1: 41.0
R@5: 67.5
R@10: 77.8
V2T R@1: 32.8
R@5: 58.8
R@10: 69.9
R@1: 34.2
R@5: 61.8
R@10: 73.2
R@1: 37.1
R@5: 65.0
R@10: 75.1
LSMDC T2V R@1: 18.0
R@5: 36.1
R@10: 43.4
R@1: 18.4
R@5: 35.3
R@10: 43.0
R@1: 20.4
R@5: 37.1
R@10: 45.7
V2T R@1: 15.9
R@5: 31.0
R@10: 39.2
R@1: 16.5
R@5: 32.1
R@10: 40.0
R@1: 17.9
R@5: 34.6
R@10: 42.1
MSVD T2V R@1: 38.0
R@5: 68.6
R@10: 79.0
R@1: 40.3
R@5: 70.0
R@10: 79.7
R@1: 42.0
R@5: 71.6
R@10: 81.2
V2T R@1: 57.5
R@5: 79.9
R@10: 85.4
R@1: 61.8
R@5: 81.0
R@10: 87.0
R@1: 62.7
R@5: 82.8
R@10: 87.6