Skip to content

Latest commit



94 lines (63 loc) · 2.9 KB

File metadata and controls

94 lines (63 loc) · 2.9 KB

Single-modality Video Understanding

We currenent release the code and models for:

  • Masked Pretraining

  • Short-term Video Understaning

    • K400 and SthSthV2
  • Long-term Video Understaning

    • Breakfast, COIN and LVU


  • 🔥 03/12/2024: Pretrained models on ImageNet-1K are released.


You can find the dataset instructions in DATASET.

Model ZOO

You can find all the models and the scripts in MODEL_ZOO.


Masked Pretraining

We use CLIP pretrained models as the unmasked teachers by default:

For training, you can simply run the pretraining scripts as follows:

bash ./exp/k400/videomamba_middle_mask/


  1. Chage DATA_PATH to your data path before running the scripts.
  2. --sampling_rate is set to 1 for sprase sampling.
  3. The latest checkpoint will be automatically saved while training, thus we use a large --save_ckpt_freq.
  4. For VideoMamba-M, we use CLIP-B-ViT as the teacher.

Short-term Video Understanding

For finetuning, you can simply run the fine-tuning scripts as follows:

bash ./exp/k400/videomamba_middle_mask/


  1. Chage DATA_PATH And PREFIX to your data path before running the scripts.
  2. Set --finetune when using masked pretrained model.
  3. The best checkpoint will be automatically evaluated with --test_best.
  4. Set --test_num_segment and --test_num_crop for different evaluation strategies.
  5. To only run evaluation, just set --eval.

Long-term Video Understanding

For BreakFast and COIN, you can simply run the fine-tuning scripts as follows:

bash ./exp/breakfast/videomamba_middle_mask/

For LVU, there are classification and regression tasks, you can simply run the fine-tuning scripts as follows:

# classification
bash ./exp/lvu/
# regression
bash ./exp/lvu/

Notes: For regression tasks, the data should be preprocessed with normalization as in ViS4mer.

⚠️ Using Trimmed Video

By default, we use Kinetics_sparse dataset for different datasets. However, in ViS4mer, the authors use trimmed clips with sliding window, which may improve the results. We also provided a dataset with sliding window as follows:

# classification
bash ./exp/lvu/
# regression
bash ./exp/lvu/


  1. Set trimmed for the length of trimmed videos.
  2. Set time_stride for the length of sliding window.