Skip to content

Latest commit

 

History

History

adatad

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 

AdaTAD

End-to-End Temporal Action Detection with 1B Parameters Across 1000 Frames
Shuming Liu, Chen-Lin Zhang, Chen Zhao, Bernard Ghanem

Abstract

Recently, temporal action detection (TAD) has seen significant performance improvement with end-to-end training. However, due to the memory bottleneck, only models with limited scales and limited data volumes can afford end-to-end training, which inevitably restricts TAD performance. In this paper, we reduce the memory consumption for end-to-end training, and manage to scale up the TAD backbone to 1 billion parameters and the input video to 1,536 frames, leading to significant detection performance. The key to our approach lies in our proposed temporal-informative adapter (TIA), which is a novel lightweight module that reduces training memory. Using TIA, we free the humongous backbone from learning to adapt to the TAD task by only updating the parameters in TIA. TIA also leads to better TAD representation by temporally aggregating context from adjacent frames throughout the backbone. We evaluate our model across four representative datasets. Owing to our efficient design, we are able to train end-to-end on VideoMAEv2-giant and achieve 75.4% mAP on THUMOS14, being the first end-to-end model to outperform the best feature-based methods.

Prepare the pretrained VideoMAE checkpoints

Before running the experiments, please download the pretrained VideoMAE model weights (converted from original repo), and put them under the path ./pretrained/.

Model Pretrain Dataset Finetune Dataset Original Link Converted Checkpoints
VideoMAE-S K400 K400 Url Google Drive
VideoMAE-B K400 K400 Url mmaction2
VideoMAE-L K400 K400 Url mmaction2
VideoMAE-H K400 K400 Url Google Drive
VideoMAEv2-g Hybrid K710 Url Not Provided
  • Note that we are not allowed to redistribute VideoMAEv2's checkpoints. You can fill out the official request form, then convert the checkpoint by the following command.
python tools/model_converters/convert_videomaev2.py \
    vit_g_hybrid_pt_1200e_k710_ft.pth pretrained/vit-giant-p14_videomaev2-hybrid_pt_1200e_k710_ft_my.pth

ActivityNet Results

Please refer to README.md to prepare the raw video of ActivityNet.

Backbone GPUs Setting Frames Img Size Classifier mAP@0.5 mAP@0.75 mAP@0.95 ave. mAP Config Download
VideoMAE-S 4 AdaTAD 768 160 CUHK 56.23 38.90 8.88 37.81 config model | log
VideoMAE-B 4 AdaTAD 768 160 CUHK 56.72 39.44 9.54 38.35 config model | log
VideoMAE-L 4 AdaTAD 768 160 CUHK 57.73 40.53 9.96 39.21 config model | log
VideoMAE-H 4 AdaTAD 768 160 CUHK 57.77 40.60 9.78 39.31 config model | log
VideoMAEV2-g 4 AdaTAD 768 160 CUHK 58.42 40.89 10.01 39.77 config model | log
VideoMAEV2-g 8 AdaTAD 768 224 CUHK 58.57 41.19 10.27 39.86 config model | log
VideoMAEV2-g 8 AdaTAD 768 224 InternVideo 61.74 43.17 10.68 41.85 config log
VideoMAEV2-g 8 AdaTAD 768 224 InternVideo2 63.59 44.31 10.66 42.90 config log
  • To train the model on ActivityNet, you can run the following command.
torchrun --nnodes=1 --nproc_per_node=4 --rdzv_backend=c10d --rdzv_endpoint=localhost:0 tools/train.py configs/adatad/anet/e2e_anet_videomae_s_192x4_160_adapter.py
  • To use the same checkpoint but test with another classifier, you can run the following command.
torchrun --nnodes=1 --nproc_per_node=4 --rdzv_backend=c10d --rdzv_endpoint=localhost:0 tools/test.py configs/adatad/anet/e2e_anet_videomaev2_g_192x4_224_adapter_internvideo2.py --checkpoint epoch_10_cba1017a.pth

[NEW] We provide the following checkpoints which does not require external classifier but directly trains 200 classification head, for the convenience of zero-shot inference.

Backbone GPUs Setting Frames Img Size Classifier mAP@0.5 mAP@0.75 mAP@0.95 ave. mAP Config Download
VideoMAE-L 4 AdaTAD 768 224 x 59.00 39.96 9.15 39.15 config model | log

THUMOS-14 Results

Please refer to README.md to prepare the raw video of THUMOS.

Backbone GPUs Setting Frames Img Size mAP@0.3 mAP@0.4 mAP@0.5 mAP@0.6 mAP@0.7 ave. mAP Config Download
VideoMAE-S 2 AdaTAD 768 160 83.90 79.01 72.38 61.57 48.27 69.03 config model | log
VideoMAE-B 2 AdaTAD 768 160 85.95 81.86 75.02 63.29 49.56 71.14 config model | log
VideoMAE-L 2 AdaTAD 768 160 87.17 83.58 76.88 66.81 53.13 73.51 config model | log
VideoMAE-H 2 AdaTAD 768 160 88.42 84.63 78.72 69.04 53.95 74.95 config model | log
VideoMAEV2-g 2 AdaTAD 768 160 88.63 85.39 79.17 68.34 53.79 75.06 config model | log
VideoMAEV2-g 2 AdaTAD 1536 224 89.93 86.83 81.24 69.97 57.36 77.07 config model | log
VideoMAEV2-g 2 AdaTAD† 1536 224 88.43 84.72 77.88 68.51 53.72 74.65 config model | log
  • To train the model on THUMOS, you can run the following command.
torchrun --nnodes=1 --nproc_per_node=2 --rdzv_backend=c10d --rdzv_endpoint=localhost:0 tools/train.py configs/adatad/thumos/e2e_thumos_videomae_s_768x1_160_adapter.py
  • To search the adapter's learning rate, or change other hyper-parameters, you can run the following command.
torchrun --nnodes=1 --nproc_per_node=2 --rdzv_backend=c10d --rdzv_endpoint=localhost:0 tools/train.py configs/adatad/thumos/e2e_thumos_videomae_s_768x1_160_adapter.py \ 
  --cfg-options optimizer.backbone.custom.0.lr=1e-4 --id 1

EPIC-KITCHENS Results

Before running the experiments, please download the EPIC-pretrained VideoMAE's weights, and put them under the path ./pretrained/.

Model Pretrain Dataset Finetune Dataset Checkpoints
VideoMAE-L InternVideo1 EPIC-Noun Google Drive
VideoMAE-L InternVideo1 EPIC-Verb Google Drive

Please refer to README.md to prepare the raw video of EPIC-Kitchens.

Subset Backbone GPUs Setting Frames Img Size mAP@0.1 mAP@0.2 mAP@0.3 mAP@0.4 mAP@0.5 ave. mAP Config Download
Noun VideoMAE-Noun 2 AdaTAD 768x8 160 33.88 32.41 30.58 27.66 22.67 29.44 config model | log
Verb VideoMAE-Verb 2 AdaTAD 768x8 160 33.02 32.43 30.51 27.80 24.69 29.69 config model | log
  • To train the model on EPIC-Kitchens, you can run the following command.
torchrun --nnodes=1 --nproc_per_node=2 --rdzv_backend=c10d --rdzv_endpoint=localhost:0 tools/train.py configs/adatad/epic/e2e_epic_videomae_l_ft_768x8_160_adapter_noun.py

Ego4D-MQ Results

Before running the experiments, please download the InternVideo1-MQ weights, and put them under the path ./pretrained/.

Model Pretrain Dataset Finetune Dataset Original Link Converted Checkpoints
InternVideo1-MQ InternVideo1-K700 Ego4D-Verb + Ego4D-MQ Url Google Drive

Please refer to README.md to prepare the raw video of Ego4D-MQ.

Backbone GPUs Setting Frames Img Size mAP@0.1 mAP@0.2 mAP@0.3 mAP@0.4 mAP@0.5 ave. mAP Config Download
InternVideo1-MQ 2 AdaTAD 1800x4 192 33.69 31.19 28.37 26.12 22.67 28.41 config model | log
  • To train the model on Ego4D-MQ, you can run the following command.
torchrun --nnodes=1 --nproc_per_node=2 --rdzv_backend=c10d --rdzv_endpoint=localhost:0 tools/train.py configs/adatad/ego4d/e2e_ego4d_internvideo_1800x4_192_adapter_lr4e-4.py

Multi-THUMOS Results

Please refer to README.md to prepare the raw video of Multi-THUMOS.

Backbone GPUs Setting Frames Img Size mAP@0.2 mAP@0.5 mAP@0.7 ave. mAP (0.1:0.9:0.1) Config Download
VideoMAE-S 2 AdaTAD 768 160 61.34 46.74 26.88 40.77 config model | log
VideoMAE-B 2 AdaTAD 768 160 63.90 48.74 28.72 42.76 config model | log
VideoMAE-L 2 AdaTAD 768 160 66.06 51.80 31.73 45.15 config model | log
VideoMAE-H 2 AdaTAD 768 160 67.20 52.99 32.70 46.02 config model | log
VideoMAEV2-g 2 AdaTAD 768 160 68.23 53.87 33.03 46.74 config model | log
VideoMAEV2-g 2 AdaTAD 1536 224 71.11 55.83 34.86 48.73 config model | log
  • To train the model on Multi-THUMOS, you can run the following command.
torchrun --nnodes=1 --nproc_per_node=2 --rdzv_backend=c10d --rdzv_endpoint=localhost:0 tools/train.py configs/adatad/multi_thumos/e2e_multithumos_videomae_s_768x1_160_adapter.py

Charades Results

Please refer to README.md to prepare the raw video of Charades.

Backbone GPUs Setting Frames Img Size mAP@0.2 mAP@0.5 mAP@0.7 ave. mAP (0.1:0.9:0.1) Config Download
VideoMAE-S 2 AdaTAD 512 160 35.89 27.43 16.35 24.14 config model | log
VideoMAE-B 2 AdaTAD 512 160 40.84 31.75 20.05 27.99 config model | log
VideoMAE-L 2 AdaTAD 512 160 47.00 37.01 23.05 32.31 config model | log
VideoMAE-H 2 AdaTAD 512 160 48.76 38.80 24.85 33.94 config model | log
VideoMAEV2-g 4 AdaTAD 512 160 53.72 42.91 27.69 37.56 config model | log
  • To train the model on Charades, you can run the following command.
torchrun --nnodes=1 --nproc_per_node=2 --rdzv_backend=c10d --rdzv_endpoint=localhost:0 tools/train.py configs/adatad/charades/e2e_charades_videomae_s_512x1_160_adapter.py

Citation

@InProceedings{Liu_2024_CVPR,
    author    = {Liu, Shuming and Zhang, Chen-Lin and Zhao, Chen and Ghanem, Bernard},
    title     = {End-to-End Temporal Action Detection with 1B Parameters Across 1000 Frames},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2024},
    pages     = {18591-18601}
}