Skip to content
/ MoTE Public

[NeurIPS'2024] MoTE: Reconciling Generalization with Specialization for Visual-Language to Video Knowledge Transfer

License

Notifications You must be signed in to change notification settings

ZMHH-H/MoTE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

[NeurIPS' 2024] MoTE: Reconciling Generalization with Specialization for Visual-Language to Video Knowledge Transfer

Conference

Paper

Overview

Official implementation of our MoTE, an effective Visual-Language to video knowledge transfer framework that enjoys both superior generalization and specialization, striking an optimal trade-off between close-set and zero-shot performance in one unified model.

vis

📬 Requirements

  • PyTorch (2.0.1 recommended)
  • RandAugment
  • pprint
  • tqdm
  • dotmap
  • yaml
  • csv
  • decord

🔗 Data Preparation

Dataset

  • For downloading the Kinetics-400&600 datasets, you can refer to mmaction2 or CVDF.
  • For UCF-101 and HMDB-51, you can get them from the official website.
  • We rescale all videos to height=256 pixels, it is not necessary, but it will save a lot of memory storage space and fast IO speed.

Video Loader

By default, we decode the videos on the fly using decord.

Example of annotation
  abseiling/aaa.mp4 0
  abseiling/bbb.mp4 0

(Optional) We can also extract videos into frames for fast reading. Please refer to BIKE repo for the detaied steps.

🐱 Model Zoo

We train one unified model for both close-set and zero-shot video recognition tasks. The corresponding results, checkpoints, and configs are listed in the table below.

Close-set performance (Kinetics-400)

Architecture Input Views Top-1(%) Top-5(%) checkpoint config
ViT-B/16 8x224 1x1 81.8 95.9 MoTE_B16 config
ViT-B/16 8x224 4x3 83.0 96.3 MoTE_B16 config
ViT-L/14 8x224 4x3 86.8 97.5 MoTE_L14 config
ViT-L/14 16x224 4x3 87.2 97.7 - -

Zero-shot performance (UCF-101 & HMDB-51 & k600)

Architecture Input Views UCF-101 HMDB-51 Kinetics-600 config
ViT-B/16 8x224 3x1 83.4 55.8 70.2 UCF/HMDB/K600
ViT-L/14 8x224 3x1 88.7 61.4 78.4 UCF/HMDB/K600

🚤 Training

By default , we train our model on Kinetics-400 in a Single Machine.

# We train the ViT-L/14 model using 6 layers of MoTE, with 4 temporal experts per layer.
bash scripts/run_train.sh configs/k400/k400_train_video_vitl-14-f8.yaml

🌊 Testing

Close-set evaluation. We adopt single-view (1 x 1 views) or multi-view (4 x 3 views) evaluation protocol with 8 frames per view.

# Single-view evaluation
bash scripts/run_test.sh k400_train_video_vitl-14-f8.yaml MoTE_L14.pt

# Multi-view evaluation
bash scripts/run_test.sh k400_train_video_vitl-14-f8.yaml MoTE_L14.pt --test_clips 4 --test_crops 3  

Zero-shot evaluation. We use 3 x 1 views for evaluation with 8 frames per view.

# UCF-101
bash scripts/run_test_zeroshot.sh configs/ucf101/ucf_split1.yaml MoTE_L14.pt --test_clips 3

# HMDB-51
bash scripts/run_test_zeroshot.sh configs/hmdb51/hmdb_split1.yaml MoTE_L14.pt --test_clips 3

# Kinetics-600
bash scripts/run_test_zeroshot.sh configs/k600/k600_zs_test_split1.yaml MoTE_L14.pt --test_clips 3

📌 BibTeX & Citation

If our work is useful to you, please consider citing our paper using the following BibTeX entry.

@Article{MoTE,
  title={MoTE: Reconciling Generalization with Specialization for Visual-Language to Video Knowledge Transfer},
  author={Zhu, Minghao and Wang, Zhengpu and Hu, Mengxian and Dang, Ronghao and Lin, Xiao and Zhou, Xun and Liu, Chengju and Chen, Qijun},
  journal={arXiv preprint arXiv:2410.10589},
  year={2024}
}

📝 Acknowledgement

Our code builds on BIKE and CLIP. Thank them for their excellent works!

About

[NeurIPS'2024] MoTE: Reconciling Generalization with Specialization for Visual-Language to Video Knowledge Transfer

Resources

License

Stars

Watchers

Forks

Packages

No packages published