G-TAD: Sub-Graph Localization for Temporal Action Detection
Mengmeng Xu, Chen Zhao, David S. Rojas, Ali Thabet, Bernard Ghanem
Temporal action detection is a fundamental yet challenging task in video understanding. Video context is a critical cue to effectively detect actions, but current works mainly focus on temporal context, while neglecting semantic context as well as other important context properties. In this work, we propose a graph convolutional network (GCN) model to adaptively incorporate multi-level semantic context into video features and cast temporal action detection as a sub-graph localization problem. Specifically, we formulate video snippets as graph nodes, snippet-snippet correlations as edges, and actions associated with context as target sub-graphs. With graph convolution as the basic operation, we design a GCN block called GCNeXt, which learns the features of each node by aggregating its context and dynamically updates the edges in the graph. To localize each sub-graph, we also design an SGAlign layer to embed each sub-graph into the Euclidean space. Extensive experiments show that G-TAD is capable of finding effective video context without extra supervision and achieves state-of-the-art performance on two detection benchmarks. On ActivityNet-1.3, it obtains an average mAP of 34.09%; on THUMOS14, it reaches 51.6% at IoU@0.5 when combined with a proposal processing method.
ActivityNet-1.3 with CUHK classifier.
Features | mAP@0.5 | mAP@0.75 | mAP@0.95 | ave. mAP | Config | Download |
---|---|---|---|---|---|---|
TSN | 50.19 | 35.04 | 8.11 | 34.18 | config | model | log |
TSP | 52.33 | 37.58 | 8.42 | 36.20 | config | model | log |
THUMOS-14 with UtrimmedNet classifier
Features | mAP@0.3 | mAP@0.4 | mAP@0.5 | mAP@0.6 | mAP@0.7 | ave. mAP | Config | Download |
---|---|---|---|---|---|---|---|---|
TSN | 60.52 | 55.08 | 48.50 | 39.60 | 28.74 | 46.49 | config | model | log |
I3D | 63.35 | 59.07 | 51.76 | 42.65 | 31.66 | 49.70 | config | model | log |
You can use the following command to train a model.
torchrun --nnodes=1 --nproc_per_node=1 --rdzv_backend=c10d --rdzv_endpoint=localhost:0 tools/train.py ${CONFIG_FILE} [optional arguments]
Example: train GTAD on ActivityNet dataset.
torchrun --nnodes=1 --nproc_per_node=1 --rdzv_backend=c10d --rdzv_endpoint=localhost:0 tools/train.py configs/gtad/anet_tsp.py
For more details, you can refer to the Training part in the Usage.
You can use the following command to test a model.
torchrun --nnodes=1 --nproc_per_node=1 --rdzv_backend=c10d --rdzv_endpoint=localhost:0 tools/test.py ${CONFIG_FILE} --checkpoint ${CHECKPOINT_FILE} [optional arguments]
Example: test GTAD on ActivityNet dataset.
torchrun --nnodes=1 --nproc_per_node=1 --rdzv_backend=c10d --rdzv_endpoint=localhost:0 tools/test.py configs/gtad/anet_tsp.py --checkpoint exps/anet/gtad_tsp_100x100/gpu1_id0/checkpoint/epoch_7.pth
For more details, you can refer to the Test part in the Usage.
@InProceedings{xu2020gtad,
author = {Xu, Mengmeng and Zhao, Chen and Rojas, David S. and Thabet, Ali and Ghanem, Bernard},
title = {G-TAD: Sub-Graph Localization for Temporal Action Detection},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2020}
}