Video Question Answering with Phrases via Semantic Roles
Arka Sadhu, Kan Chen Ram Nevatia
NAACL 2021
Video Question Answering has been studied through the lens of N-way phrase classification. While this eases evaluation, it severely limits its application in the wild. Here, we require the model to generate the answer and we propose a novel evaluation metric using relative scoring and contrastive scoring. We further create ActivityNet-SRL-QA and Charades-SRL-QA.
-
Clone repo:
git clone https://github.com/TheShadow29/Video-QAP cd Video-QAP export ROOT=$(pwd)
-
Setup a new conda environment using the file vidqap_env.yml file provided. Please refer to Miniconda for details on installing conda.
MINICONDA_ROOT=[to your Miniconda/Anaconda root directory] conda env create -f vidqap_env.yml --prefix $MINICONDA_ROOT/envs/vidqap_pyt conda activate vidqap_pyt
-
See instructions to install fairseq INSTALL.md
-
To download the datasets ActivityNet-SRL-QA and Charades-SRL-QA see DATA.md
- Configuration files are insider configs
Use one of the models
cd $ROOT python code/main_dist.py "vogqap_asrlqa" --ds_to_use='asrl_qa' --mdl.name='vog_qa' --train.bs=4 --train.epochs=10 --train.lr=1e-4
lqa, mtx_qa, butd_qa, vog_qa
- Main evaluation file is
vidqa_code/eval_fn_vidqap.py
. You can use this as a stand-alone file for a separate dataset as well.
cd $ROOT
python vidqa_code/eval_fn_vidqap.py --pred_file=... --ds_to_use='asrl_qa' --split_type='valid' --met_keys='meteor,rouge,bert_score'
ToDo:
- Add more documentation on how to run the models
- Add pre-trained model weights.
- Support dataset creation for new caption dataset.
We thank:
- @LuoweiZhou: for their codebase on GVD (https://github.com/facebookresearch/grounded-video-description) along with the extracted features for ActivityNet.
- @antoine77340 for their codebase on S3D pretrained on Howto100M (https://github.com/antoine77340/S3D_HowTo100M) used for feature extraction on Charades.
- allennlp for providing demo and pre-trained model for SRL.
- fairseq for sequence generation implementation and transformer encoder decoder models.
@inproceedings{Sadhu2021VideoQA,
title={Video Question Answering with Phrases via Semantic Roles},
author={Arka Sadhu and Kan Chen and R. Nevatia},
booktitle={NAACL},
year={2021}
}