End-to-End Temporal Action Detection with 1B Parameters Across 1000 Frames
Shuming Liu, Chen-Lin Zhang, Chen Zhao, Bernard Ghanem
Recently, temporal action detection (TAD) has seen significant performance improvement with end-to-end training. However, due to the memory bottleneck, only models with limited scales and limited data volumes can afford end-to-end training, which inevitably restricts TAD performance. In this paper, we reduce the memory consumption for end-to-end training, and manage to scale up the TAD backbone to 1 billion parameters and the input video to 1,536 frames, leading to significant detection performance. The key to our approach lies in our proposed temporal-informative adapter (TIA), which is a novel lightweight module that reduces training memory. Using TIA, we free the humongous backbone from learning to adapt to the TAD task by only updating the parameters in TIA. TIA also leads to better TAD representation by temporally aggregating context from adjacent frames throughout the backbone. We evaluate our model across four representative datasets. Owing to our efficient design, we are able to train end-to-end on VideoMAEv2-giant and achieve 75.4% mAP on THUMOS14, being the first end-to-end model to outperform the best feature-based methods.
Before running the experiments, please download the pretrained VideoMAE model weights (converted from original repo), and put them under the path ./pretrained/
.
Model | Pretrain Dataset | Finetune Dataset | Original Link | Converted Checkpoints |
---|---|---|---|---|
VideoMAE-S | K400 | K400 | Url | Google Drive |
VideoMAE-B | K400 | K400 | Url | mmaction2 |
VideoMAE-L | K400 | K400 | Url | mmaction2 |
VideoMAE-H | K400 | K400 | Url | Google Drive |
VideoMAEv2-g | Hybrid | K710 | Url | Not Provided |
- Note that we are not allowed to redistribute VideoMAEv2's checkpoints. You can fill out the official request form, then convert the checkpoint by the following command.
python tools/model_converters/convert_videomaev2.py \
vit_g_hybrid_pt_1200e_k710_ft.pth pretrained/vit-giant-p14_videomaev2-hybrid_pt_1200e_k710_ft_my.pth
Please refer to README.md to prepare the raw video of ActivityNet.
Backbone | GPUs | Setting | Frames | Img Size | Classifier | mAP@0.5 | mAP@0.75 | mAP@0.95 | ave. mAP | Config | Download |
---|---|---|---|---|---|---|---|---|---|---|---|
VideoMAE-S | 4 | AdaTAD | 768 | 160 | CUHK | 56.23 | 38.90 | 8.88 | 37.81 | config | model | log |
VideoMAE-B | 4 | AdaTAD | 768 | 160 | CUHK | 56.72 | 39.44 | 9.54 | 38.35 | config | model | log |
VideoMAE-L | 4 | AdaTAD | 768 | 160 | CUHK | 57.73 | 40.53 | 9.96 | 39.21 | config | model | log |
VideoMAE-H | 4 | AdaTAD | 768 | 160 | CUHK | 57.77 | 40.60 | 9.78 | 39.31 | config | model | log |
VideoMAEV2-g | 4 | AdaTAD | 768 | 160 | CUHK | 58.42 | 40.89 | 10.01 | 39.77 | config | model | log |
VideoMAEV2-g | 8 | AdaTAD | 768 | 224 | CUHK | 58.57 | 41.19 | 10.27 | 39.86 | config | model | log |
VideoMAEV2-g | 8 | AdaTAD | 768 | 224 | InternVideo | 61.74 | 43.17 | 10.68 | 41.85 | config | log |
VideoMAEV2-g | 8 | AdaTAD | 768 | 224 | InternVideo2 | 63.59 | 44.31 | 10.66 | 42.90 | config | log |
- To train the model on ActivityNet, you can run the following command.
torchrun --nnodes=1 --nproc_per_node=4 --rdzv_backend=c10d --rdzv_endpoint=localhost:0 tools/train.py configs/adatad/anet/e2e_anet_videomae_s_192x4_160_adapter.py
- To use the same checkpoint but test with another classifier, you can run the following command.
torchrun --nnodes=1 --nproc_per_node=4 --rdzv_backend=c10d --rdzv_endpoint=localhost:0 tools/test.py configs/adatad/anet/e2e_anet_videomaev2_g_192x4_224_adapter_internvideo2.py --checkpoint epoch_10_cba1017a.pth
[NEW] We provide the following checkpoints which does not require external classifier but directly trains 200 classification head, for the convenience of zero-shot inference.
Backbone | GPUs | Setting | Frames | Img Size | Classifier | mAP@0.5 | mAP@0.75 | mAP@0.95 | ave. mAP | Config | Download |
---|---|---|---|---|---|---|---|---|---|---|---|
VideoMAE-L | 4 | AdaTAD | 768 | 224 | x | 59.00 | 39.96 | 9.15 | 39.15 | config | model | log |
Please refer to README.md to prepare the raw video of THUMOS.
Backbone | GPUs | Setting | Frames | Img Size | mAP@0.3 | mAP@0.4 | mAP@0.5 | mAP@0.6 | mAP@0.7 | ave. mAP | Config | Download |
---|---|---|---|---|---|---|---|---|---|---|---|---|
VideoMAE-S | 2 | AdaTAD | 768 | 160 | 83.90 | 79.01 | 72.38 | 61.57 | 48.27 | 69.03 | config | model | log |
VideoMAE-B | 2 | AdaTAD | 768 | 160 | 85.95 | 81.86 | 75.02 | 63.29 | 49.56 | 71.14 | config | model | log |
VideoMAE-L | 2 | AdaTAD | 768 | 160 | 87.17 | 83.58 | 76.88 | 66.81 | 53.13 | 73.51 | config | model | log |
VideoMAE-H | 2 | AdaTAD | 768 | 160 | 88.42 | 84.63 | 78.72 | 69.04 | 53.95 | 74.95 | config | model | log |
VideoMAEV2-g | 2 | AdaTAD | 768 | 160 | 88.63 | 85.39 | 79.17 | 68.34 | 53.79 | 75.06 | config | model | log |
VideoMAEV2-g | 2 | AdaTAD | 1536 | 224 | 89.93 | 86.83 | 81.24 | 69.97 | 57.36 | 77.07 | config | model | log |
VideoMAEV2-g | 2 | AdaTAD† | 1536 | 224 | 88.43 | 84.72 | 77.88 | 68.51 | 53.72 | 74.65 | config | model | log |
- To train the model on THUMOS, you can run the following command.
torchrun --nnodes=1 --nproc_per_node=2 --rdzv_backend=c10d --rdzv_endpoint=localhost:0 tools/train.py configs/adatad/thumos/e2e_thumos_videomae_s_768x1_160_adapter.py
- To search the adapter's learning rate, or change other hyper-parameters, you can run the following command.
torchrun --nnodes=1 --nproc_per_node=2 --rdzv_backend=c10d --rdzv_endpoint=localhost:0 tools/train.py configs/adatad/thumos/e2e_thumos_videomae_s_768x1_160_adapter.py \
--cfg-options optimizer.backbone.custom.0.lr=1e-4 --id 1
Before running the experiments, please download the EPIC-pretrained VideoMAE's weights, and put them under the path ./pretrained/
.
Model | Pretrain Dataset | Finetune Dataset | Checkpoints |
---|---|---|---|
VideoMAE-L | InternVideo1 | EPIC-Noun | Google Drive |
VideoMAE-L | InternVideo1 | EPIC-Verb | Google Drive |
Please refer to README.md to prepare the raw video of EPIC-Kitchens.
Subset | Backbone | GPUs | Setting | Frames | Img Size | mAP@0.1 | mAP@0.2 | mAP@0.3 | mAP@0.4 | mAP@0.5 | ave. mAP | Config | Download |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Noun | VideoMAE-Noun | 2 | AdaTAD | 768x8 | 160 | 33.88 | 32.41 | 30.58 | 27.66 | 22.67 | 29.44 | config | model | log |
Verb | VideoMAE-Verb | 2 | AdaTAD | 768x8 | 160 | 33.02 | 32.43 | 30.51 | 27.80 | 24.69 | 29.69 | config | model | log |
- To train the model on EPIC-Kitchens, you can run the following command.
torchrun --nnodes=1 --nproc_per_node=2 --rdzv_backend=c10d --rdzv_endpoint=localhost:0 tools/train.py configs/adatad/epic/e2e_epic_videomae_l_ft_768x8_160_adapter_noun.py
Before running the experiments, please download the InternVideo1-MQ weights, and put them under the path ./pretrained/
.
Model | Pretrain Dataset | Finetune Dataset | Original Link | Converted Checkpoints |
---|---|---|---|---|
InternVideo1-MQ | InternVideo1-K700 | Ego4D-Verb + Ego4D-MQ | Url | Google Drive |
Please refer to README.md to prepare the raw video of Ego4D-MQ.
Backbone | GPUs | Setting | Frames | Img Size | mAP@0.1 | mAP@0.2 | mAP@0.3 | mAP@0.4 | mAP@0.5 | ave. mAP | Config | Download |
---|---|---|---|---|---|---|---|---|---|---|---|---|
InternVideo1-MQ | 2 | AdaTAD | 1800x4 | 192 | 33.69 | 31.19 | 28.37 | 26.12 | 22.67 | 28.41 | config | model | log |
- To train the model on Ego4D-MQ, you can run the following command.
torchrun --nnodes=1 --nproc_per_node=2 --rdzv_backend=c10d --rdzv_endpoint=localhost:0 tools/train.py configs/adatad/ego4d/e2e_ego4d_internvideo_1800x4_192_adapter_lr4e-4.py
Please refer to README.md to prepare the raw video of Multi-THUMOS.
Backbone | GPUs | Setting | Frames | Img Size | mAP@0.2 | mAP@0.5 | mAP@0.7 | ave. mAP (0.1:0.9:0.1) | Config | Download |
---|---|---|---|---|---|---|---|---|---|---|
VideoMAE-S | 2 | AdaTAD | 768 | 160 | 61.34 | 46.74 | 26.88 | 40.77 | config | model | log |
VideoMAE-B | 2 | AdaTAD | 768 | 160 | 63.90 | 48.74 | 28.72 | 42.76 | config | model | log |
VideoMAE-L | 2 | AdaTAD | 768 | 160 | 66.06 | 51.80 | 31.73 | 45.15 | config | model | log |
VideoMAE-H | 2 | AdaTAD | 768 | 160 | 67.20 | 52.99 | 32.70 | 46.02 | config | model | log |
VideoMAEV2-g | 2 | AdaTAD | 768 | 160 | 68.23 | 53.87 | 33.03 | 46.74 | config | model | log |
VideoMAEV2-g | 2 | AdaTAD | 1536 | 224 | 71.11 | 55.83 | 34.86 | 48.73 | config | model | log |
- To train the model on Multi-THUMOS, you can run the following command.
torchrun --nnodes=1 --nproc_per_node=2 --rdzv_backend=c10d --rdzv_endpoint=localhost:0 tools/train.py configs/adatad/multi_thumos/e2e_multithumos_videomae_s_768x1_160_adapter.py
Please refer to README.md to prepare the raw video of Charades.
Backbone | GPUs | Setting | Frames | Img Size | mAP@0.2 | mAP@0.5 | mAP@0.7 | ave. mAP (0.1:0.9:0.1) | Config | Download |
---|---|---|---|---|---|---|---|---|---|---|
VideoMAE-S | 2 | AdaTAD | 512 | 160 | 35.89 | 27.43 | 16.35 | 24.14 | config | model | log |
VideoMAE-B | 2 | AdaTAD | 512 | 160 | 40.84 | 31.75 | 20.05 | 27.99 | config | model | log |
VideoMAE-L | 2 | AdaTAD | 512 | 160 | 47.00 | 37.01 | 23.05 | 32.31 | config | model | log |
VideoMAE-H | 2 | AdaTAD | 512 | 160 | 48.76 | 38.80 | 24.85 | 33.94 | config | model | log |
VideoMAEV2-g | 4 | AdaTAD | 512 | 160 | 53.72 | 42.91 | 27.69 | 37.56 | config | model | log |
- To train the model on Charades, you can run the following command.
torchrun --nnodes=1 --nproc_per_node=2 --rdzv_backend=c10d --rdzv_endpoint=localhost:0 tools/train.py configs/adatad/charades/e2e_charades_videomae_s_512x1_160_adapter.py
@InProceedings{Liu_2024_CVPR,
author = {Liu, Shuming and Zhang, Chen-Lin and Zhao, Chen and Ghanem, Bernard},
title = {End-to-End Temporal Action Detection with 1B Parameters Across 1000 Frames},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2024},
pages = {18591-18601}
}