Official PyTorch implementation of paper "Spot What Matters: Learning Context Using Graph Convolutional Networks for Weakly-Supervised Action Detection". Accepted to International Workshop on Deep Learning for Human-Centric Activity Understanding (DL-HAU2020) - ICPR 2020.
Paper is available on Arxiv and here.
acgcn -- Actor-Context/Centric Graph Convolutional Networks.
python 3.7
pytorch 1.2.0 (CUDA 9.2, cuDNN v7.6.2)
torchvision 0.4.0
numpy >= 1.17
pillow
Optionally, for visualizations:
matplotlib
scipy >= 1.3.0
scikit-learn >= 0.22.2
Create a conda environment:
conda create -n myenv python=3.7
source activate myenv
For older versions of conda
, you might need to use conda
instead of source
:
conda activate myenv
Clone this repository:
git clone https://github.com/micts/acgcn.git
Add the repository's directory to $PYTHONPATH
by adding the following line to your .bashrc
file:
export PYTHONPATH=/path/to/this/repo/lib:$PYTHONPATH
Finally, apply the changes:
source ~/.bashrc
Install requirements:
conda install --file requirements.txt
Finally, install PyTorch and torchvision:
conda install pytorch==1.2.0 torchvision==0.4.0 cudatoolkit=9.2 -c pytorch
We use the Daily Action on Localization in Youtube (DALY) dataset: http://thoth.inrialpes.fr/daly/index.php.
See instructions on how to download DALY.
Important Note
Several videos of DALY are not available on Youtube anymore, and others have different resolution than the original videos. It is recommended to download the original videos from Inria's cache, which can be accessed at http://thoth.inrialpes.fr/daly/requestaccess.php. It is recommended to check that all videos are available in the cache by matching them with those contained in the annotations (/data/DALY/annotations/daly1.1.0.pkl
).
In order to download the videos from INRIA's cache, first modify utils/download_videos.sh
to provide the username and password required in INRIA's cache. Then download the videos using:
cd utils/
./download_videos.sh
The videos will be saved in data/DALY/DALY_videos
.
In case there are missing videos in the cache, you can always try to download them from Youtube, see link for instructions above.
Assuming that the downloaded videos are placed in data/DALY/DALY_videos
, we extract the frames from each video using:
cd utils/
./extract_frames.sh ../data/DALY/DALY_videos ../data/DALY/DALY_frames
The above command uses ffmpeg
to extract the frames, and saves them under data/DALY/DALY_frames
.
We resize all frames to 224x224 pixels:
python resize_frames.py ../data/DALY/DALY_frames ../data/DALY/frames 224 224
data/DALY/frames
now contains the resized frames. We can remove the original frames and videos:
cd ..
rm -rf data/DALY/DALY_frames data/DALY/DALY_videos
Download all required human tubes and annotations as a .pkl file (annotated_data.pkl
) using the following link. Place annotated_data.pkl
in data/DALY/annotations
.
The file structure under data/DALY/
should be as follows:
DALY/
|_ annotations
| |_ all_training_videos.pkl
| |_ all_videos.pkl
| |_ annotated_data.pkl
| |_ daly1.1.0.pkl
| |_ frames_size.pkl
| |_ training_videos.pkl
| |_ validation_videos.pkl
|_ frames
| |_ video_name_0.mp4
| | |_ frame000001.jpg
| | |_ frame000002.jpg
| | |_ ...
| |_ video_name_1.mp4
| | |_ frame000001.jpg
| | |_ frame000002.jpg
| | |_ ...
In the following, we describe the process of training and testing the Baseline model and the GCN model. Both models use I3D as a backbone (pre-trained on Imagenet + Kinetics-400) and augmented with a RoI pooling layer for action detection. The GCN model further incorporates Graph Convolutional Networks to model actor-context relations as actor-object and actor-actor interactions.
Train a GCN model:
sh tools/train_gcn.sh
Train a Baseline model:
sh tools/train_baseline.sh
Both commands above run train.py
. For help and a description of input arguments:
python tools/train.py -h
Every run is assigned a unique identifier/filename of the form yyyy-mm-dd_hh-mm-ss
, i.e. 2020-03-25_23-56-26
. Results, such as model's training history, are saved under results/model_name/filename
, for example results/gcn/2020-03-25_23-56-26
. The directory contains a pickle file with the content of config.py
(model's hyperparameters and other configuration settings), and a series of .pth
files. A .pth
file is of the form epoch_x_loss.pth
, where x
indicates the epoch number, and loss
corresponds to the epoch's validation loss. e.g. epoch_300_1.108.pth
. A .pth
file contains, for a given epoch, saved model's weights, epoch's training and validation loss, and evaluation metric results. The files are saved every x-number of epochs (see num_epochs_to_val
in config.py
).
Make sure to replace the values inside the following scripts with the model's information (model id, path to model weights, path to config file, etc.) to be used for inference. Use either a model you trained (model weights and config files are located in results/
), or use one of the trained models already provided (see Results).
Inference on the test set using a GCN model:
sh tools/test_gcn.sh
Inference using a Baseline model:
sh tools/test_baseline.sh
Both commands above run inference.py
. For help and a description of input arguments:
python tools/inference.py -h
We provide trained models (model weights and config files) for GCN and Baseline. There are 5 models for GCN and Baseline respectively, using the same hyperparameters, but with different initialized seeds. Models and results correspond to those reported in the paper. Average Test Video-mAP (± std) across the five repetitions is 61.82 (± 0.51) for GCN and 59.58 (± 0.22) for Baseline.
Model ID | Num Layers | Num Graphs | Merge Function | Test Video-mAP | Model and Config |
---|---|---|---|---|---|
2020-03-22_12-50-27 | 2 | 2 | Concat | 62.73 | Weights, Config |
2020-03-24_15-15-49 | 2 | 2 | Concat | 61.14 | Weights, Config |
2020-03-24_15-13-25 | 2 | 2 | Concat | 61.82 | Weights, Config |
2020-03-25_23-56-26 | 2 | 2 | Concat | 61.8 | Weights, Config |
2020-03-26_16-04-26 | 2 | 2 | Concat | 61.64 | Weights, Config |
Model ID | Test Video-mAP | Model and Config |
---|---|---|
2020-03-07_21-32-26 | 59.54 | Weights, Config |
2020-03-08_17-09-39 | 59.17 | Weights, Config |
2020-03-13_03-46-05 | 59.79 | Weights, Config |
2020-03-07_20-22-15 | 59.77 | Weights, Config |
2020-03-08_07-39-52 | 59.67 | Weights, Config |
We would like to thank Philippe Weinzaepfel for providing us with the predicted human tubes of their tracking-by-detetion model.