We use multi-node training on a SLURM cluster with submitit for producing the results and models in the paper.
Please install submitit
in your conda environment:
pip install submitit
Please refer to PRETRAIN.md.
Visual Encoder | Text Decoder | METEOR | ROUGE-L | CIDEr | Pre-trained Vis. Encoder (md5) |
checkpoint (md5) |
---|---|---|---|---|---|---|
TSF-B | GPT-2 | 0.282 | 0.517 | 0.833 | download (dbcc4d) | download (68a71f) |
TSF-L@HR | GPT-2 XL | 0.298 | 0.539 | 0.977 | download (5c69b8) | download (443263) |
Ego4D val split
torchrun --nproc_per_node=1 \
eval_narrator.py \
--caption-top-p 0.95 --caption-temperature 0.7 \
--eval-freq 10000 \
--resume $CHECKPOINT
Backbone | EK-100 MIR avg. mAP |
EK-100 MIR avg. nDCG |
Charades-Ego mAP^ |
EGTEA mean acc. |
EgoMCQ intra-video acc. |
checkpoint | |
---|---|---|---|---|---|---|---|
Prev. SOTA^^ | TSF-B | 22.1/23.3 | 22.1/27.9 | 25.2 | 17.6 | 57.2 | Epoch 1, best epoch |
LAVILA | TSF-B | 29.7/30.9 | 31.5/32.0 | 26.8 | 28.9 | 59.9 | Epoch 1^, Epoch 5 |
LAVILA | TSF-L | 35.0/36.1 | 34.2/34.6 | 28.9 | 34.1 | 63.1 | Epoch 1^, Epoch 3 |
^ Note that the pre-trained checkpoint to evaluate CharadesEgo is different from that to evalute other datasets. Specifically, we use the checkpoint at epoch 1 to zero-shot evaluate CharadesEgo and the checkpoint that achieves best average mAP on EK-100 MIR to evaluate other datasets, as is done in EgoVLP. Our guess is that since CharadesEgo videos (captured by head-mounted mobile cameras) are visually different from Ego4D/EPIC-Kitchens videos (captured by professional action cameras, eg GoPro), pre-training on Ego4D videos for longer will lead to some potential domain discrepancy.
^^ We use the checkpoints released by EgoVLP and convert them to be compatible with this codebase. Also note that our reproduced numbers are better than the reported numbers, especially on EK-100 MIR since we evaluate on raw videos directly (for more details, check out Appendix F & Table 10 in our paper).
1. EK-100 MIR
python eval_zeroshot.py --dataset ek100_mir --root datasets/EK100/video_ht256px/ --clip-length 4 --resume $PATH
By increasing the number of frames per clip, eg --clip-length 16
, you are expected to see a better performance.
2. EK-100 CLS
python eval_zeroshot.py --dataset ek100_cls --metadata-val datasets/EK100/epic-kitchens-100-annotations/EPIC_100_validation.csv --resume $PATH
3. Charades-Ego
python eval_zeroshot.py --dataset charades_ego --metadata-val datasets/CharadesEgo/CharadesEgo/CharadesEgo_v1_test_only1st.csv --root datasets/CharadesEgo/CharadesEgo_v1_480/ --clip-length 16 --sparse-sample --resume $PATH
4. EGTEA
python eval_zeroshot.py --dataset egtea --metadata-val datasets/EGTEA/test_split1.txt --root datasets/EGTEA/cropped_clips/ --clip-length 16 --clip-stride 2 --num-crops 3 --num-clips 10 --resume $PATH
5. EgoMCQ
python eval_zeroshot.py --dataset ego4d_mcq --metadata-val datasets/Ego4D/egomcq.json --root datasets/Ego4D/video_5min_chunks_288px/ --clip-length 4 --resume $PATH --use-half -j 4
Backbone | avg mAP | avg nDCG | Pretrain (md5) | Fine-tuned checkpoint | training log | |
---|---|---|---|---|---|---|
LAVILA | TSF-B | 50.5 | 65.0 | download (d73a9c) | download | download |
LAVILA | TSF-L | 50.9 | 66.5 | download (c89337) | download | download |
Training and evaluating scripts
# TimeSformer-Base
python run_with_submitit_finetune_retrieval.py \
--pretrain-model $PATH \
--use-checkpoint --nodes 4
# TimeSformer-Large
python run_with_submitit_finetune_retrieval.py \
--pretrain-model $PATH \
--batch-size 4 \
--use-checkpoint --nodes 4
torchrun --nproc_per_node=8 \
main_finetune_retrieval.py \
--output-dir $OUT_DIR \
--pretrain-model $PATH \
--use-checkpoint
Note that you might see a slight drop of performance when training on a single node compared to multiple nodes (everything else being the same) because of a smaller total batch size.
Evaluation is done every --eval-freq 5
epochs by default during fine-tuning.
If you want to evaluate any checkpoint after fine-tuning, please switch to --evaluate
mode and specify the path to the checkpoint by --resume $FINETUNED_CHECKPOINT
.
torchrun --nproc_per_node=1 \
main_finetune_retrieval.py \
--output-dir $OUT_DIR \
--pretrain-model $PATH \
--use-checkpoint \
--evaluate \
--resume $FINETUNED_CHECKPOINT
Backbone | video mAP | Pretrain^ (md5) | Fine-tuned checkpoint | training log | |
---|---|---|---|---|---|
LAVILA | TSF-B | 33.7 | download (02dbb9) | download | download |
LAVILA | TSF-L | 36.1 | download (9a25de) | download | download |
^ Note that the pre-trained checkpoint for fine-tuning CharadesEgo is different from that for fine-tuning EK-100 or EGTEA. Same reason stated above.
Training and evaluating scripts
# TimeSformer-Base
python run_with_submitit_finetune_retrieval.py \
--dataset charades_ego \
--metadata datasets/CharadesEgo/CharadesEgo/metadata_filtered_train.pkl \
--metadata-val datasets/CharadesEgo/CharadesEgo/CharadesEgo_v1_test_only1st.csv \
--root datasets/CharadesEgo/CharadesEgo_v1_480/ \
--epochs 10 \
--save-freq 1 --eval-freq 1 \
--sparse-sample \
--pretrain-model $PATH \
--use-checkpoint --nodes 4
# TimeSformer-Large
python run_with_submitit_finetune_retrieval.py \
--dataset charades_ego \
--metadata datasets/CharadesEgo/CharadesEgo/metadata_filtered_train.pkl \
--metadata-val datasets/CharadesEgo/CharadesEgo/CharadesEgo_v1_test_only1st.csv \
--root datasets/CharadesEgo/CharadesEgo_v1_480/ \
--epochs 10 \
--save-freq 1 --eval-freq 1 \
--sparse-sample \
--pretrain-model $PATH \
--batch-size 4 \
--use-checkpoint --nodes 4
torchrun --nproc_per_node=1 \
main_finetune_retrieval.py \
--dataset charades_ego \
--metadata datasets/CharadesEgo/CharadesEgo/metadata_filtered_train.pkl \
--metadata-val datasets/CharadesEgo/CharadesEgo/CharadesEgo_v1_test_only1st.csv \
--root datasets/CharadesEgo/CharadesEgo_v1_480/ \
--output-dir $OUT_DIR \
--sparse-sample \
--pretrain-model $PATH \
--evaluate \
--resume $FINETUNED_CHECKPOINT
Backbone | V+N+A multi-head | Verb top-1 | Noun top-1 | Action top-1 | Pretrain (md5) | Fine-tuned checkpoint | training log | |
---|---|---|---|---|---|---|---|---|
LAVILA | TSF-B | no | 67.7 | 56.7 | 46.2 | download (d73a9c) | download | download |
LAVILA | TSF-B | yes | 69.0 | 58.4 | 46.9 | download (d73a9c) | download | download |
LAVILA | TSF-L | yes | 72.0 | 62.9 | 51.0 | download (c89337) | download | download |
Training and evaluating scripts
# TimeSformer-Base
python run_with_submitit_finetune_classification.py \
--pretrain-model $PATH \
--use-vn-classifier --num-classes 97 300 3806 \
--use-sgd --wd 4e-5 --lr-multiplier-on-backbone 0.1 \
--use-checkpoint --node 1
# TimeSformer-Large
python run_with_submitit_finetune_classification.py \
--pretrain-model $PATH \
--use-vn-classifier --num-classes 97 300 3806 \
--use-sgd --wd 4e-5 --lr-multiplier-on-backbone 0.1 \
--use-checkpoint --node 4
Backbone | mean Acc. | Pretrain (md5) | Fine-tuned checkpoint | training log | |
---|---|---|---|---|---|
LAVILA | TSF-B | 70.12 | download (d73a9c) | download | download |
LAVILA | TSF-L | 76.00 | download (c89337) | download | download |
Training and evaluating scripts
# TimeSformer-Base
python run_with_submitit_finetune_classification.py \
--dataset egtea \
--metadata-train datasets/EGTEA/train_split1.txt \
--metadata-val datasets/EGTEA/test_split1.txt \
--root datasets/EGTEA/cropped_clips/ \
--pretrain-model $PATH \
--num-classes 106 \
--use-sgd --wd 4e-5 \
--use-checkpoint --node 1
# TimeSformer-Large
python run_with_submitit_finetune_classification.py \
--dataset egtea \
--metadata-train datasets/EGTEA/train_split1.txt \
--metadata-val datasets/EGTEA/test_split1.txt \
--root datasets/EGTEA/cropped_clips/ \
--pretrain-model $PATH \
--num-classes 106 \
--use-sgd --wd 4e-5 \
--batch-size 4 \
--use-checkpoint --node 4
torchrun --nproc_per_node=1 \
main_finetune_classification.py \
--dataset egtea \
--metadata-train datasets/EGTEA/train_split1.txt \
--metadata-val datasets/EGTEA/test_split1.txt \
--root datasets/EGTEA/cropped_clips/ \
--output-dir $OUT_DIR \
--pretrain-model $PATH \
--num-classes 106 \
--use-sgd --wd 4e-5 \
--evaluate \
--resume $FINETUNED_CHECKPOINT \
--num-crops 3 --num-clips 10 \
--use-half