Learning Fine-Grained Visual Understanding
for Video Question Answering
via Decoupling Spatial-Temporal Modeling
BMVC 2022
Python 3.7 and CUDA 11.1
pip install -r requirements.txt
Checkpoints of pre-training (trm), ActivityNet-QA (anetqa), and AGQA (agqa) can be downloaded here.
Image-language pre-training weights are from ALBEF.
Input data is organized as follows:
data/
├── vatex/
│ ├── train.json
│ ├── val.json
│ └── vatex.h5
├── tgif/
│ ├── train.json
│ ├── val.json
│ └── tgif.h5
├── anetqa/
│ ├── train.csv
│ ├── val.csv
│ ├── test.csv
│ ├── vocab.json
│ ├── frames/
│ └── anetqa.h5
└── agqa/
├── train.json
├── val.json
├── test.json
├── vocab.json
├── frames/
└── agqa.h5
For AGQA, train_balanced.txt
and test_balanced.txt
are renamed as train.json
and test.json
.
We randomly sample 10% data from train.json
for validation in val.json
.
Following Just-Ask, we remove rare answers of ActivityNet and AGQA.
python preproc/preproc_anetqa.py -i INPUT_DIR [-o OUTPUT_DIR]
python preproc/preproc_agqa.py -i INPUT_DIR [-o OUTPUT_DIR]
INPUT_DIR
contains annotation files. OUTPUT_DIR
is the same as INPUT_DIR
if not specified.
For example,
python preproc/preproc_anetqa.py -i activitynet-qa/dataset data/anetqa
We extract frames at 3 FPS with FFmpeg,
ffmpeg -i VIDEO_PATH -vf fps=3 VIDEO_ID/%04d.png
The frames of a video are collected in a directory.
For example, the content of data/anetqa/frames/
is
data/anetqa/frames
├── v_PLqTX6ij52U/
│ ├── 0001.png
│ ├── 0002.png
│ ├── ...
├── v_d_A-ylxNbFU/
│ ├── 0001.png
│ ├── ...
├── ...
To process all videos in parallel,
find VIDEO_DIR -type f | parallel -j8 "mkdir frames/{/.} && ffmpeg -i {} -vf fps=3 frames/{/.}/%04d.png"
We extract video features with Video Swin Transformer (Swin-B on Kinetics-600).
Please see the detailed instruction here.
The features are gathered in an hdf5 file. Please rename and put it under the dataset directory.
# Training
python run.py with \
data_root=data \
num_gpus=1 \
num_nodes=1 \
task_pretrain_trm \
per_gpu_batchsize=32 \
load_path=/path/to/vqa.pth
# Inference
python run.py with \
data_root=data \
num_gpus=1 \
num_nodes=1 \
task_pretrain_trm \
per_gpu_batchsize=32 \
load_path=/path/to/trm.ckpt \
test_only=True
# Training
python run.py with \
data_root=data/anetqa \
num_gpus=1 \
num_nodes=1 \
load_path=/path/to/trm.ckpt \
task_finetune_anetqa \
per_gpu_batchsize=8
# Inference
python run.py with \
data_root=data/anetqa \
num_gpus=1 \
num_nodes=1 \
load_path=/path/to/anetqa.ckpt \
task_finetune_anetqa \
per_gpu_batchsize=16 \
test_only=True
# Training
python run.py with \
data_root=data/agqa \
num_gpus=1 \
num_nodes=1 \
load_path=/path/to/trm.ckpt \
task_finetune_agqa \
per_gpu_batchsize=16
# Inference
python run.py with \
data_root=data/agqa \
num_gpus=1 \
num_nodes=1 \
load_path=/path/to/agqa.ckpt \
task_finetune_agqa \
per_gpu_batchsize=32 \
test_only=True
python eval.py {anet, agqa} PREDICTION GROUNDTRUTH
For example,
python eval.py anet result/anetqa_by_anetqa.json data/anetqa/test.csv
python eval.py agqa result/agqa_by_agqa.json data/agqa/test.json
bash scripts/train_anetqa_mean.sh
bash scripts/train_agqa_mean.sh
@inproceedings{Lee_2022_BMVC,
author = {Hsin-Ying Lee and Hung-Ting Su and Bing-Chen Tsai and Tsung-Han Wu and Jia-Fong Yeh and Winston H. Hsu},
title = {Learning Fine-Grained Visual Understanding for Video Question Answering via Decoupling Spatial-Temporal Modeling},
booktitle = {33rd British Machine Vision Conference 2022, {BMVC} 2022, London, UK, November 21-24, 2022},
publisher = {{BMVA} Press},
year = {2022},
url = {https://bmvc2022.mpi-inf.mpg.de/0116.pdf}
}