[NeurIPS' 2024] MoTE: Reconciling Generalization with Specialization for Visual-Language to Video Knowledge Transfer
Official implementation of our MoTE, an effective Visual-Language to video knowledge transfer framework that enjoys both superior generalization and specialization, striking an optimal trade-off between close-set and zero-shot performance in one unified model.
- PyTorch (2.0.1 recommended)
- RandAugment
- pprint
- tqdm
- dotmap
- yaml
- csv
- decord
- For downloading the Kinetics-400&600 datasets, you can refer to mmaction2 or CVDF.
- For UCF-101 and HMDB-51, you can get them from the official website.
- We rescale all videos to height=256 pixels, it is not necessary, but it will save a lot of memory storage space and fast IO speed.
By default, we decode the videos on the fly using decord.
Example of annotation
abseiling/aaa.mp4 0
abseiling/bbb.mp4 0
(Optional) We can also extract videos into frames for fast reading. Please refer to BIKE repo for the detaied steps.
We train one unified model for both close-set and zero-shot video recognition tasks. The corresponding results, checkpoints, and configs are listed in the table below.
Architecture | Input | Views | Top-1(%) | Top-5(%) | checkpoint | config |
---|---|---|---|---|---|---|
ViT-B/16 | 8x224 | 1x1 | 81.8 | 95.9 | MoTE_B16 | config |
ViT-B/16 | 8x224 | 4x3 | 83.0 | 96.3 | MoTE_B16 | config |
ViT-L/14 | 8x224 | 4x3 | 86.8 | 97.5 | MoTE_L14 | config |
ViT-L/14 | 16x224 | 4x3 | 87.2 | 97.7 | - | - |
Architecture | Input | Views | UCF-101 | HMDB-51 | Kinetics-600 | config |
---|---|---|---|---|---|---|
ViT-B/16 | 8x224 | 3x1 | 83.4 | 55.8 | 70.2 | UCF/HMDB/K600 |
ViT-L/14 | 8x224 | 3x1 | 88.7 | 61.4 | 78.4 | UCF/HMDB/K600 |
By default , we train our model on Kinetics-400 in a Single Machine.
# We train the ViT-L/14 model using 6 layers of MoTE, with 4 temporal experts per layer.
bash scripts/run_train.sh configs/k400/k400_train_video_vitl-14-f8.yaml
Close-set evaluation. We adopt single-view (1 x 1 views) or multi-view (4 x 3 views) evaluation protocol with 8 frames per view.
# Single-view evaluation
bash scripts/run_test.sh k400_train_video_vitl-14-f8.yaml MoTE_L14.pt
# Multi-view evaluation
bash scripts/run_test.sh k400_train_video_vitl-14-f8.yaml MoTE_L14.pt --test_clips 4 --test_crops 3
Zero-shot evaluation. We use 3 x 1 views for evaluation with 8 frames per view.
# UCF-101
bash scripts/run_test_zeroshot.sh configs/ucf101/ucf_split1.yaml MoTE_L14.pt --test_clips 3
# HMDB-51
bash scripts/run_test_zeroshot.sh configs/hmdb51/hmdb_split1.yaml MoTE_L14.pt --test_clips 3
# Kinetics-600
bash scripts/run_test_zeroshot.sh configs/k600/k600_zs_test_split1.yaml MoTE_L14.pt --test_clips 3
If our work is useful to you, please consider citing our paper using the following BibTeX entry.
@Article{MoTE,
title={MoTE: Reconciling Generalization with Specialization for Visual-Language to Video Knowledge Transfer},
author={Zhu, Minghao and Wang, Zhengpu and Hu, Mengxian and Dang, Ronghao and Lin, Xiao and Zhou, Xun and Liu, Chengju and Chen, Qijun},
journal={arXiv preprint arXiv:2410.10589},
year={2024}
}
Our code builds on BIKE and CLIP. Thank them for their excellent works!