Name	Name	Last commit message	Last commit date
parent directory ..
anet	anet
charades	charades
ego4d	ego4d
epic	epic
multi_thumos	multi_thumos
thumos	thumos
README.md	README.md

AdaTAD

End-to-End Temporal Action Detection with 1B Parameters Across 1000 Frames
Shuming Liu, Chen-Lin Zhang, Chen Zhao, Bernard Ghanem

Abstract

Recently, temporal action detection (TAD) has seen significant performance improvement with end-to-end training. However, due to the memory bottleneck, only models with limited scales and limited data volumes can afford end-to-end training, which inevitably restricts TAD performance. In this paper, we reduce the memory consumption for end-to-end training, and manage to scale up the TAD backbone to 1 billion parameters and the input video to 1,536 frames, leading to significant detection performance. The key to our approach lies in our proposed temporal-informative adapter (TIA), which is a novel lightweight module that reduces training memory. Using TIA, we free the humongous backbone from learning to adapt to the TAD task by only updating the parameters in TIA. TIA also leads to better TAD representation by temporally aggregating context from adjacent frames throughout the backbone. We evaluate our model across four representative datasets. Owing to our efficient design, we are able to train end-to-end on VideoMAEv2-giant and achieve 75.4% mAP on THUMOS14, being the first end-to-end model to outperform the best feature-based methods.

Prepare the pretrained VideoMAE checkpoints

Before running the experiments, please download the pretrained VideoMAE model weights (converted from original repo), and put them under the path ./pretrained/.

Model	Pretrain Dataset	Finetune Dataset	Original Link	Converted Checkpoints
VideoMAE-S	K400	K400	Url	Google Drive
VideoMAE-B	K400	K400	Url	mmaction2
VideoMAE-L	K400	K400	Url	mmaction2
VideoMAE-H	K400	K400	Url	Google Drive
VideoMAEv2-g	Hybrid	K710	Url	Not Provided

Note that we are not allowed to redistribute VideoMAEv2's checkpoints. You can fill out the official request form, then convert the checkpoint by the following command.

python tools/model_converters/convert_videomaev2.py \
    vit_g_hybrid_pt_1200e_k710_ft.pth pretrained/vit-giant-p14_videomaev2-hybrid_pt_1200e_k710_ft_my.pth

ActivityNet Results

Please refer to README.md to prepare the raw video of ActivityNet.

Backbone	GPUs	Setting	Frames	Img Size	Classifier	mAP@0.5	mAP@0.75	mAP@0.95	ave. mAP	Config	Download
VideoMAE-S	4	AdaTAD	768	160	CUHK	56.23	38.90	8.88	37.81	config	model \| log
VideoMAE-B	4	AdaTAD	768	160	CUHK	56.72	39.44	9.54	38.35	config	model \| log
VideoMAE-L	4	AdaTAD	768	160	CUHK	57.73	40.53	9.96	39.21	config	model \| log
VideoMAE-H	4	AdaTAD	768	160	CUHK	57.77	40.60	9.78	39.31	config	model \| log
VideoMAEV2-g	4	AdaTAD	768	160	CUHK	58.42	40.89	10.01	39.77	config	model \| log
VideoMAEV2-g	8	AdaTAD	768	224	CUHK	58.57	41.19	10.27	39.86	config	model \| log
VideoMAEV2-g	8	AdaTAD	768	224	InternVideo	61.74	43.17	10.68	41.85	config	log
VideoMAEV2-g	8	AdaTAD	768	224	InternVideo2	63.59	44.31	10.66	42.90	config	log

To train the model on ActivityNet, you can run the following command.

torchrun --nnodes=1 --nproc_per_node=4 --rdzv_backend=c10d --rdzv_endpoint=localhost:0 tools/train.py configs/adatad/anet/e2e_anet_videomae_s_192x4_160_adapter.py

To use the same checkpoint but test with another classifier, you can run the following command.

torchrun --nnodes=1 --nproc_per_node=4 --rdzv_backend=c10d --rdzv_endpoint=localhost:0 tools/test.py configs/adatad/anet/e2e_anet_videomaev2_g_192x4_224_adapter_internvideo2.py --checkpoint epoch_10_cba1017a.pth

[NEW] We provide the following checkpoints which does not require external classifier but directly trains 200 classification head, for the convenience of zero-shot inference.

Backbone	GPUs	Setting	Frames	Img Size	Classifier	mAP@0.5	mAP@0.75	mAP@0.95	ave. mAP	Config	Download
VideoMAE-L	4	AdaTAD	768	224	x	59.00	39.96	9.15	39.15	config	model \| log

THUMOS-14 Results

Please refer to README.md to prepare the raw video of THUMOS.

Backbone	GPUs	Setting	Frames	Img Size	mAP@0.3	mAP@0.4	mAP@0.5	mAP@0.6	mAP@0.7	ave. mAP	Config	Download
VideoMAE-S	2	AdaTAD	768	160	83.90	79.01	72.38	61.57	48.27	69.03	config	model \| log
VideoMAE-B	2	AdaTAD	768	160	85.95	81.86	75.02	63.29	49.56	71.14	config	model \| log
VideoMAE-L	2	AdaTAD	768	160	87.17	83.58	76.88	66.81	53.13	73.51	config	model \| log
VideoMAE-H	2	AdaTAD	768	160	88.42	84.63	78.72	69.04	53.95	74.95	config	model \| log
VideoMAEV2-g	2	AdaTAD	768	160	88.63	85.39	79.17	68.34	53.79	75.06	config	model \| log
VideoMAEV2-g	2	AdaTAD	1536	224	89.93	86.83	81.24	69.97	57.36	77.07	config	model \| log
VideoMAEV2-g	2	AdaTAD†	1536	224	88.43	84.72	77.88	68.51	53.72	74.65	config	model \| log

To train the model on THUMOS, you can run the following command.

torchrun --nnodes=1 --nproc_per_node=2 --rdzv_backend=c10d --rdzv_endpoint=localhost:0 tools/train.py configs/adatad/thumos/e2e_thumos_videomae_s_768x1_160_adapter.py

To search the adapter's learning rate, or change other hyper-parameters, you can run the following command.

torchrun --nnodes=1 --nproc_per_node=2 --rdzv_backend=c10d --rdzv_endpoint=localhost:0 tools/train.py configs/adatad/thumos/e2e_thumos_videomae_s_768x1_160_adapter.py \ 
  --cfg-options optimizer.backbone.custom.0.lr=1e-4 --id 1

EPIC-KITCHENS Results

Before running the experiments, please download the EPIC-pretrained VideoMAE's weights, and put them under the path ./pretrained/.

Model	Pretrain Dataset	Finetune Dataset	Checkpoints
VideoMAE-L	InternVideo1	EPIC-Noun	Google Drive
VideoMAE-L	InternVideo1	EPIC-Verb	Google Drive

Please refer to README.md to prepare the raw video of EPIC-Kitchens.

Subset	Backbone	GPUs	Setting	Frames	Img Size	mAP@0.1	mAP@0.2	mAP@0.3	mAP@0.4	mAP@0.5	ave. mAP	Config	Download
Noun	VideoMAE-Noun	2	AdaTAD	768x8	160	33.88	32.41	30.58	27.66	22.67	29.44	config	model \| log
Verb	VideoMAE-Verb	2	AdaTAD	768x8	160	33.02	32.43	30.51	27.80	24.69	29.69	config	model \| log

To train the model on EPIC-Kitchens, you can run the following command.

torchrun --nnodes=1 --nproc_per_node=2 --rdzv_backend=c10d --rdzv_endpoint=localhost:0 tools/train.py configs/adatad/epic/e2e_epic_videomae_l_ft_768x8_160_adapter_noun.py

Ego4D-MQ Results

Before running the experiments, please download the InternVideo1-MQ weights, and put them under the path ./pretrained/.

Model	Pretrain Dataset	Finetune Dataset	Original Link	Converted Checkpoints
InternVideo1-MQ	InternVideo1-K700	Ego4D-Verb + Ego4D-MQ	Url	Google Drive

Please refer to README.md to prepare the raw video of Ego4D-MQ.

Backbone	GPUs	Setting	Frames	Img Size	mAP@0.1	mAP@0.2	mAP@0.3	mAP@0.4	mAP@0.5	ave. mAP	Config	Download
InternVideo1-MQ	2	AdaTAD	1800x4	192	33.69	31.19	28.37	26.12	22.67	28.41	config	model \| log

To train the model on Ego4D-MQ, you can run the following command.

torchrun --nnodes=1 --nproc_per_node=2 --rdzv_backend=c10d --rdzv_endpoint=localhost:0 tools/train.py configs/adatad/ego4d/e2e_ego4d_internvideo_1800x4_192_adapter_lr4e-4.py

Multi-THUMOS Results

Please refer to README.md to prepare the raw video of Multi-THUMOS.

Backbone	GPUs	Setting	Frames	Img Size	mAP@0.2	mAP@0.5	mAP@0.7	ave. mAP (0.1:0.9:0.1)	Config	Download
VideoMAE-S	2	AdaTAD	768	160	61.34	46.74	26.88	40.77	config	model \| log
VideoMAE-B	2	AdaTAD	768	160	63.90	48.74	28.72	42.76	config	model \| log
VideoMAE-L	2	AdaTAD	768	160	66.06	51.80	31.73	45.15	config	model \| log
VideoMAE-H	2	AdaTAD	768	160	67.20	52.99	32.70	46.02	config	model \| log
VideoMAEV2-g	2	AdaTAD	768	160	68.23	53.87	33.03	46.74	config	model \| log
VideoMAEV2-g	2	AdaTAD	1536	224	71.11	55.83	34.86	48.73	config	model \| log

To train the model on Multi-THUMOS, you can run the following command.

torchrun --nnodes=1 --nproc_per_node=2 --rdzv_backend=c10d --rdzv_endpoint=localhost:0 tools/train.py configs/adatad/multi_thumos/e2e_multithumos_videomae_s_768x1_160_adapter.py

Charades Results

Please refer to README.md to prepare the raw video of Charades.

Backbone	GPUs	Setting	Frames	Img Size	mAP@0.2	mAP@0.5	mAP@0.7	ave. mAP (0.1:0.9:0.1)	Config	Download
VideoMAE-S	2	AdaTAD	512	160	35.89	27.43	16.35	24.14	config	model \| log
VideoMAE-B	2	AdaTAD	512	160	40.84	31.75	20.05	27.99	config	model \| log
VideoMAE-L	2	AdaTAD	512	160	47.00	37.01	23.05	32.31	config	model \| log
VideoMAE-H	2	AdaTAD	512	160	48.76	38.80	24.85	33.94	config	model \| log
VideoMAEV2-g	4	AdaTAD	512	160	53.72	42.91	27.69	37.56	config	model \| log

To train the model on Charades, you can run the following command.

torchrun --nnodes=1 --nproc_per_node=2 --rdzv_backend=c10d --rdzv_endpoint=localhost:0 tools/train.py configs/adatad/charades/e2e_charades_videomae_s_512x1_160_adapter.py

Citation

@InProceedings{Liu_2024_CVPR,
    author    = {Liu, Shuming and Zhang, Chen-Lin and Zhao, Chen and Ghanem, Bernard},
    title     = {End-to-End Temporal Action Detection with 1B Parameters Across 1000 Frames},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2024},
    pages     = {18591-18601}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adatad

adatad

README.md

AdaTAD

Abstract

Prepare the pretrained VideoMAE checkpoints

ActivityNet Results

THUMOS-14 Results

EPIC-KITCHENS Results

Ego4D-MQ Results

Multi-THUMOS Results

Charades Results

Citation

Files

adatad

Directory actions

More options

Directory actions

More options

Latest commit

History

adatad

Folders and files

parent directory

README.md

AdaTAD

Abstract

Prepare the pretrained VideoMAE checkpoints

ActivityNet Results

THUMOS-14 Results

EPIC-KITCHENS Results

Ego4D-MQ Results

Multi-THUMOS Results

Charades Results

Citation