GitHub - ZMHH-H/MoTE: [NeurIPS'2024] MoTE: Reconciling Generalization with Specialization for Visual-Language to Video Knowledge Transfer

[NeurIPS' 2024] MoTE: Reconciling Generalization with Specialization for Visual-Language to Video Knowledge Transfer

Overview

Official implementation of our MoTE, an effective Visual-Language to video knowledge transfer framework that enjoys both superior generalization and specialization, striking an optimal trade-off between close-set and zero-shot performance in one unified model.

📬 Requirements

PyTorch (2.0.1 recommended)
RandAugment
pprint
tqdm
dotmap
yaml
csv
decord

🔗 Data Preparation

Dataset

For downloading the Kinetics-400&600 datasets, you can refer to mmaction2 or CVDF.
For UCF-101 and HMDB-51, you can get them from the official website.
We rescale all videos to height=256 pixels, it is not necessary, but it will save a lot of memory storage space and fast IO speed.

Video Loader

By default, we decode the videos on the fly using decord.

Example of annotation

  abseiling/aaa.mp4 0
  abseiling/bbb.mp4 0

(Optional) We can also extract videos into frames for fast reading. Please refer to BIKE repo for the detaied steps.

🐱 Model Zoo

We train one unified model for both close-set and zero-shot video recognition tasks. The corresponding results, checkpoints, and configs are listed in the table below.

Close-set performance (Kinetics-400)

Architecture	Input	Views	Top-1(%)	Top-5(%)	checkpoint	config
ViT-B/16	8x224	1x1	81.8	95.9	MoTE_B16	config
ViT-B/16	8x224	4x3	83.0	96.3	MoTE_B16	config
ViT-L/14	8x224	4x3	86.8	97.5	MoTE_L14	config
ViT-L/14	16x224	4x3	87.2	97.7	-	-

Zero-shot performance (UCF-101 & HMDB-51 & k600)

Architecture	Input	Views	UCF-101	HMDB-51	Kinetics-600	config
ViT-B/16	8x224	3x1	83.4	55.8	70.2	UCF/HMDB/K600
ViT-L/14	8x224	3x1	88.7	61.4	78.4	UCF/HMDB/K600

🚤 Training

By default , we train our model on Kinetics-400 in a Single Machine.

# We train the ViT-L/14 model using 6 layers of MoTE, with 4 temporal experts per layer.
bash scripts/run_train.sh configs/k400/k400_train_video_vitl-14-f8.yaml

🌊 Testing

Close-set evaluation. We adopt single-view (1 x 1 views) or multi-view (4 x 3 views) evaluation protocol with 8 frames per view.

# Single-view evaluation
bash scripts/run_test.sh k400_train_video_vitl-14-f8.yaml MoTE_L14.pt

# Multi-view evaluation
bash scripts/run_test.sh k400_train_video_vitl-14-f8.yaml MoTE_L14.pt --test_clips 4 --test_crops 3

Zero-shot evaluation. We use 3 x 1 views for evaluation with 8 frames per view.

# UCF-101
bash scripts/run_test_zeroshot.sh configs/ucf101/ucf_split1.yaml MoTE_L14.pt --test_clips 3

# HMDB-51
bash scripts/run_test_zeroshot.sh configs/hmdb51/hmdb_split1.yaml MoTE_L14.pt --test_clips 3

# Kinetics-600
bash scripts/run_test_zeroshot.sh configs/k600/k600_zs_test_split1.yaml MoTE_L14.pt --test_clips 3

📌 BibTeX & Citation

If our work is useful to you, please consider citing our paper using the following BibTeX entry.

@Article{MoTE,
  title={MoTE: Reconciling Generalization with Specialization for Visual-Language to Video Knowledge Transfer},
  author={Zhu, Minghao and Wang, Zhengpu and Hu, Mengxian and Dang, Ronghao and Lin, Xiao and Zhou, Xun and Liu, Chengju and Chen, Qijun},
  journal={arXiv preprint arXiv:2410.10589},
  year={2024}
}

📝 Acknowledgement

Our code builds on BIKE and CLIP. Thank them for their excellent works!

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.vscode		.vscode
clip		clip
configs		configs
datasets		datasets
figure		figure
lists		lists
modules		modules
scripts		scripts
utils		utils
LICENSE		LICENSE
README.md		README.md
test.py		test.py
test_zeroshot.py		test_zeroshot.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[NeurIPS' 2024] MoTE: Reconciling Generalization with Specialization for Visual-Language to Video Knowledge Transfer

Overview

📬 Requirements

🔗 Data Preparation

Dataset

Video Loader

🐱 Model Zoo

Close-set performance (Kinetics-400)

Zero-shot performance (UCF-101 & HMDB-51 & k600)

🚤 Training

🌊 Testing

📌 BibTeX & Citation

📝 Acknowledgement

About

Releases 1

Packages

Languages

License

ZMHH-H/MoTE

Folders and files

Latest commit

History

Repository files navigation

[NeurIPS' 2024] MoTE: Reconciling Generalization with Specialization for Visual-Language to Video Knowledge Transfer

Overview

📬 Requirements

🔗 Data Preparation

Dataset

Video Loader

🐱 Model Zoo

Close-set performance (Kinetics-400)

Zero-shot performance (UCF-101 & HMDB-51 & k600)

🚤 Training

🌊 Testing

📌 BibTeX & Citation

📝 Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages