Skip to content
/ dest Public

[BMVC 2022 Spotlight] Learning Fine-Grained Visual Understanding for Video Question Answering via Decoupling Spatial-Temporal Modeling

License

Notifications You must be signed in to change notification settings

shinying/dest

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Learning Fine-Grained Visual Understanding
for Video Question Answering
via Decoupling Spatial-Temporal Modeling

BMVC 2022

Project Page | arXiv

Installation

Python 3.7 and CUDA 11.1

pip install -r requirements.txt

Checkpoints

Checkpoints of pre-training (trm), ActivityNet-QA (anetqa), and AGQA (agqa) can be downloaded here.
Image-language pre-training weights are from ALBEF.

Preprocess

Input data is organized as follows:

data/
├── vatex/
│   ├── train.json
│   ├── val.json
│   └── vatex.h5
├── tgif/
│   ├── train.json
│   ├── val.json
│   └── tgif.h5
├── anetqa/
│   ├── train.csv
│   ├── val.csv
│   ├── test.csv
│   ├── vocab.json
│   ├── frames/
│   └── anetqa.h5
└── agqa/
    ├── train.json
    ├── val.json
    ├── test.json
    ├── vocab.json
    ├── frames/
    └── agqa.h5

For AGQA, train_balanced.txt and test_balanced.txt are renamed as train.json and test.json.
We randomly sample 10% data from train.json for validation in val.json.

Annotations

Following Just-Ask, we remove rare answers of ActivityNet and AGQA.

python preproc/preproc_anetqa.py -i INPUT_DIR [-o OUTPUT_DIR]
python preproc/preproc_agqa.py -i INPUT_DIR [-o OUTPUT_DIR]

INPUT_DIR contains annotation files. OUTPUT_DIR is the same as INPUT_DIR if not specified.
For example,

python preproc/preproc_anetqa.py -i activitynet-qa/dataset data/anetqa

Frames

We extract frames at 3 FPS with FFmpeg,

ffmpeg -i VIDEO_PATH -vf fps=3 VIDEO_ID/%04d.png

The frames of a video are collected in a directory.
For example, the content of data/anetqa/frames/ is

data/anetqa/frames
├── v_PLqTX6ij52U/
│   ├── 0001.png
│   ├── 0002.png
│   ├── ...
├── v_d_A-ylxNbFU/
│   ├── 0001.png
│   ├── ...
├── ...

To process all videos in parallel,

find VIDEO_DIR -type f | parallel -j8 "mkdir frames/{/.} && ffmpeg -i {} -vf fps=3 frames/{/.}/%04d.png"

Video Features

We extract video features with Video Swin Transformer (Swin-B on Kinetics-600).
Please see the detailed instruction here.
The features are gathered in an hdf5 file. Please rename and put it under the dataset directory.

Pre-training with Temporal Referring Modeling

# Training
python run.py with \
    data_root=data \
    num_gpus=1 \
    num_nodes=1 \
    task_pretrain_trm \
    per_gpu_batchsize=32 \
    load_path=/path/to/vqa.pth

# Inference
python run.py with \
    data_root=data \
    num_gpus=1 \
    num_nodes=1 \
    task_pretrain_trm \
    per_gpu_batchsize=32 \
    load_path=/path/to/trm.ckpt \
    test_only=True

Downstream

ActivityNet-QA

# Training
python run.py with \
    data_root=data/anetqa \
    num_gpus=1 \
    num_nodes=1 \
    load_path=/path/to/trm.ckpt \
    task_finetune_anetqa \
    per_gpu_batchsize=8

# Inference
python run.py with \
    data_root=data/anetqa \
    num_gpus=1 \
    num_nodes=1 \
    load_path=/path/to/anetqa.ckpt \
    task_finetune_anetqa \
    per_gpu_batchsize=16 \
    test_only=True

AGQA

# Training
python run.py with \
    data_root=data/agqa \
    num_gpus=1 \
    num_nodes=1 \
    load_path=/path/to/trm.ckpt \
    task_finetune_agqa \
    per_gpu_batchsize=16

# Inference
python run.py with \
    data_root=data/agqa \
    num_gpus=1 \
    num_nodes=1 \
    load_path=/path/to/agqa.ckpt \
    task_finetune_agqa \
    per_gpu_batchsize=32 \
    test_only=True

Evaluation

python eval.py {anet, agqa} PREDICTION GROUNDTRUTH

For example,

python eval.py anet result/anetqa_by_anetqa.json data/anetqa/test.csv
python eval.py agqa result/agqa_by_agqa.json data/agqa/test.json

Preliminary Analysis with ALBEF

bash scripts/train_anetqa_mean.sh
bash scripts/train_agqa_mean.sh

Citation

@inproceedings{Lee_2022_BMVC,
    author    = {Hsin-Ying Lee and Hung-Ting Su and Bing-Chen Tsai and Tsung-Han Wu and Jia-Fong Yeh and Winston H. Hsu},
    title     = {Learning Fine-Grained Visual Understanding for Video Question Answering via Decoupling Spatial-Temporal Modeling},
    booktitle = {33rd British Machine Vision Conference 2022, {BMVC} 2022, London, UK, November 21-24, 2022},
    publisher = {{BMVA} Press},
    year      = {2022},
    url       = {https://bmvc2022.mpi-inf.mpg.de/0116.pdf}
}

Acknowledgements

The code is based on METER and ALBEF.

About

[BMVC 2022 Spotlight] Learning Fine-Grained Visual Understanding for Video Question Answering via Decoupling Spatial-Temporal Modeling

Resources

License

Stars

Watchers

Forks