Skip to content

Latest commit

 

History

History
134 lines (99 loc) · 4.31 KB

readme.md

File metadata and controls

134 lines (99 loc) · 4.31 KB

Video Question Answering with Prior Knowledge and Object-sensitive Learning

Paper | TIP 2022

Relationship-Sensitive Transformer
Figure 1. Overview of the proposed PKOL architecture for video question answering.

Table of Contents

Setups

  • Ubuntu 20.04
  • CUDA 11.5
  • Python 3.7
  • PyTorch 1.7.0 + cu110
  1. Clone this repository:
git clone https://github.com/zchoi/PKOL.git
  1. Install dependencies:
conda create -n vqa python=3.7
conda activate vqa
pip install -r requirements.txt

Data Preparation

  • Text Features

    Download pre-extracted text features from here (code: zca5), and place it into data/{dataset}-qa/ for MSVD-QA, MSRVTT-QA and data/tgif-qa/{question_type}/ for TGIF-QA, respectively.

  • Visual Features

    • For appearance and motion features, we used this repo [1].

    • For object features, we used the Faster R-CNN [2] pre-trained with Visual Genome [3].

    Download pre-extracted visual features from here (code: zca5), and place it into data/{dataset}-qa/ for MSVD-QA, MSRVTT-QA and data/tgif-qa/{question_type}/ for TGIF-QA, respectively.

Important

The object features are huge, (especially ~700GB for TGIF-QA), please be cautious of disk space when downloading.

Experiments

For MSVD-QA and MSRVTT-QA:

Training:

python train_iterative.py --cfg configs/msvd_qa.yml

Evaluation:

python validate_iterative.py --cfg configs/msvd_qa.yml

For TGIF-QA:

Choose a suitable config file in configs/{task}.yml for one of 4 tasks: action, transition, count, frameqa to train/val the model. For example, to train with action task, run the following command:

Training:

python train_iterative.py --cfg configs/tgif_qa_action.yml

Evaluation:

python validate_iterative.py --cfg configs/tgif_qa_action.yml

Results

Performance on MSVD-QA and MSRVTT-QA datasets:

Model MSVD-QA MSRVTT-QA
PKOL 41.1 36.9

Performance on TGIF-QA dataset:

Model Count ↓ FrameQA ↑ Trans. ↑ Action ↑
PKOL 3.67 61.8 82.8 74.6

Reference

[1] Le, Thao Minh, et al. "Hierarchical conditional relation networks for video question answering." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020.

[2] Ren, Shaoqing, et al. "Faster r-cnn: Towards real-time object detection with region proposal networks." Advances in neural information processing systems 28 (2015).

[3] Krishna, Ranjay, et al. "Visual genome: Connecting language and vision using crowdsourced dense image annotations." International journal of computer vision 123.1 (2017): 32-73.

Citation

@article{PKOL,
  title   = {Video Question Answering with Prior Knowledge and Object-sensitive Learning},
  author  = {Pengpeng Zeng and 
             Haonan Zhang and 
             Lianli Gao and 
             Jingkuan Song and 
             Heng Tao Shen
             },
  journal = {IEEE Transactions on Image Processing},
  doi     = {10.1109/TIP.2022.3205212},
  pages   = {5936--5948}
  year    = {2022}
}

Acknowledgements

Our code implementation is based on this repo.