Single-modality Video Understanding

We currenent release the code and models for:

Masked Pretraining
Short-term Video Understaning
- K400 and SthSthV2
Long-term Video Understaning
- Breakfast, COIN and LVU

Update

🔥 03/12/2024: Pretrained models on ImageNet-1K are released.

Datasets

You can find the dataset instructions in DATASET.

Model ZOO

You can find all the models and the scripts in MODEL_ZOO.

Usage

Masked Pretraining

We use CLIP pretrained models as the unmasked teachers by default:

Follow extract.ipynb to extract visual encoder from CLIP.
Change MODEL_PATH in clip.py.

For training, you can simply run the pretraining scripts as follows:

bash ./exp/k400/videomamba_middle_mask/run_mask_pretrain.sh

Notes:

Chage DATA_PATH to your data path before running the scripts.

--sampling_rate is set to 1 for sprase sampling.

The latest checkpoint will be automatically saved while training, thus we use a large --save_ckpt_freq.

For VideoMamba-M, we use CLIP-B-ViT as the teacher.

Short-term Video Understanding

For finetuning, you can simply run the fine-tuning scripts as follows:

bash ./exp/k400/videomamba_middle_mask/run_f8x224.sh

Notes:

Chage DATA_PATH And PREFIX to your data path before running the scripts.

Set --finetune when using masked pretrained model.

The best checkpoint will be automatically evaluated with --test_best.

Set --test_num_segment and --test_num_crop for different evaluation strategies.

To only run evaluation, just set --eval.

Long-term Video Understanding

For BreakFast and COIN, you can simply run the fine-tuning scripts as follows:

bash ./exp/breakfast/videomamba_middle_mask/run_f32x224.sh

For LVU, there are classification and regression tasks, you can simply run the fine-tuning scripts as follows:

# classification
bash ./exp/lvu/run_class.sh
# regression
bash ./exp/lvu/run_regression.sh

Notes: For regression tasks, the data should be preprocessed with normalization as in ViS4mer.

⚠️ Using Trimmed Video

By default, we use Kinetics_sparse dataset for different datasets. However, in ViS4mer, the authors use trimmed clips with sliding window, which may improve the results. We also provided a dataset with sliding window as follows:

# classification
bash ./exp/lvu/run_class_trim.sh
# regression
bash ./exp/lvu/run_regression_trim.sh

Notes:

Set trimmed for the length of trimmed videos.

Set time_stride for the length of sliding window.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Single-modality Video Understanding

Update

Datasets

Model ZOO

Usage

Masked Pretraining

Short-term Video Understanding

Long-term Video Understanding

⚠️ Using Trimmed Video

Files

README.md

Latest commit

History

README.md

File metadata and controls

Single-modality Video Understanding

Update

Datasets

Model ZOO

Usage

Masked Pretraining

Short-term Video Understanding

Long-term Video Understanding

⚠️ Using Trimmed Video