Learning Salient Boundary Feature for Anchor-free Temporal Action Localization
Chuming Lin, Chengming Xu, Donghao Luo, Yabiao Wang, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, Yanwei Fu
Temporal action localization is an important yet challenging task in video understanding. Typically, such a task aims at inferring both the action category and localization of the start and end frame for each action instance in a long, untrimmed video.While most current models achieve good results by using pre-defined anchors and numerous actionness, such methods could be bothered with both large number of outputs and heavy tuning of locations and sizes corresponding to different anchors. Instead, anchor-free methods is lighter, getting rid of redundant hyper-parameters, but gains few attention. In this paper, we propose the first purely anchor-free temporal localization method, which is both efficient and effective. Our model includes (i) an end-to-end trainable basic predictor, (ii) a saliency-based refinement module to gather more valuable boundary features for each proposal with a novel boundary pooling, and (iii) several consistency constraints to make sure our model can find the accurate boundary given arbitrary proposals. Extensive experiments show that our method beats all anchor-based and actionness-guided methods with a remarkable margin on THUMOS14, achieving state-of-the-art results, and comparable ones on ActivityNet v1.3.
ActivityNet-1.3 with CUHK classifier.
E2E | Setting | GPUs | mAP@0.5 | mAP@0.75 | mAP@0.95 | Average mAP | Config | Download |
---|---|---|---|---|---|---|---|---|
False | Feature-TSP | 1 | 54.44 | 36.72 | 8.69 | 36.10 | config | model | log |
True | I3D-R50-768x96x96 | 4 | 52.77 | 35.01 | 7.74 | 34.57 | config | model | log |
THUMOS14
E2E | Setting | GPUs | mAP@0.3 | mAP@0.4 | mAP@0.5 | mAP@0.6 | mAP@0.7 | Average mAP | Config | Download |
---|---|---|---|---|---|---|---|---|---|---|
False | Feature-I3D | 1 | 73.20 | 68.45 | 60.16 | 46.74 | 31.24 | 55.96 | config | model | log |
True | I3D-R50-256x96x96 | 1 | 3.88 | 48.81 | 41.36 | 31.70 | 21.03 | 39.36 | config | model | log |
You can use the following command to train a model.
torchrun --nnodes=1 --nproc_per_node=1 --rdzv_backend=c10d --rdzv_endpoint=localhost:0 tools/train.py ${CONFIG_FILE} [optional arguments]
Example: train AFSD on ActivityNet dataset.
torchrun --nnodes=1 --nproc_per_node=1 --rdzv_backend=c10d --rdzv_endpoint=localhost:0 tools/train.py configs/afsd/anet_tsp.py
For more details, you can refer to the Training part in the Usage.
You can use the following command to test a model.
torchrun --nnodes=1 --nproc_per_node=1 --rdzv_backend=c10d --rdzv_endpoint=localhost:0 tools/test.py ${CONFIG_FILE} --checkpoint ${CHECKPOINT_FILE} [optional arguments]
Example: test AFSD on ActivityNet dataset.
torchrun --nnodes=1 --nproc_per_node=1 --rdzv_backend=c10d --rdzv_endpoint=localhost:0 tools/test.py configs/afsd/anet_tsp.py --checkpoint exps/anet/afsd_tsp_96/gpu1_id0/checkpoint/epoch_9.pth
For more details, you can refer to the Test part in the Usage.
@InProceedings{Lin_2021_CVPR,
author = {Lin, Chuming and Xu, Chengming and Luo, Donghao and Wang, Yabiao and Tai, Ying and Wang, Chengjie and Li, Jilin and Huang, Feiyue and Fu, Yanwei},
title = {Learning Salient Boundary Feature for Anchor-free Temporal Action Localization},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2021},
pages = {3320-3329}
}