Skip to content

VRU-NExT/VideoQA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

84 Commits
 
 

Repository files navigation

Video Question Answering: Datasets, Algorithms and Challenges

This repository contains a list of codes, leaderboards, dataset and paper lists of Video Question Answering (VideoQA). If you found any error, please don't hesitate to open an issue or pull request.

If you find this repository helpful for your work, please kindly cite the following paper. The Bibtex are listed below:

@inproceedings{zhong2022Video,
      title={Video Question Answering: Datasets, Algorithms and Challenges}, 
      author={Yaoyao Zhong and Junbin Xiao and Wei Ji and Yicong Li and Weihong Deng and Tat-Seng Chua},
      booktitle={The 2022 Conference on Empirical Methods in Natural Language Processing},
      year={2022},
}

Contributor

Contributed by Yaoyao Zhong, Junbin Xiao and Wei Ji.

Thanks for supports from our adviser Tat-Seng Chua!


Resources


Open-sourced code


Leaderboards

Inference QA

NExT-QA
Rank Name Techniques and Insights NExT-Val NExT-Test
/ Human Performance / 88.4 /
1 [VGT-ECCV2022] Graph, Transformer, Hierarchical Learning, Multi-Granularity 55.02 53.68
2 [ATP-CVPR2022] Transformer, Cross-modal Pre-training and Fine-tuning 54.3 /
3 [EIGV-ACMMM2022] Causality, Graph / 53.7
4 [(2.5+1)D-Transformer-AAAI2022] Graph, Transformer, Multi-Granularity 53.4 /
5 [MMA-arXiv2022] Graph, Hierarchical Learning 53.3 52.4
6 [HQGA-AAAI2022] Modular Networks, Graph, Hierarchical Learning, Multi-Granularity 51.42 51.75
7 [IGV-CVPR2022] Causality, Graph / 51.34
8 [HGA-AAAI2020] Graph 49.74 50.01
9 [HCRN-CVPR2020] Modular Networks, Hierarchical Learning 48.20 48.98

Factoid QA

Pre-Training
Rank Name Cross-Modal Pre-Training TGIF-Frame MSVD-QA MSRVTT-QA
1 [MERLOT-NIPS2021] Youtube-Temporal-180M & Conceptual Captions-3M 69.5 / 43.1
2 [VIOLET] WebVid2.5M & Youtube-Temporal-180M & Conceptual Captions-3M 68.9 47.9 43.9
3 [SSRea-NIPS2021] Visual Genome & COCO 60.2 45.5 41.6
4 [VQA-T-ICCV2021] H2VQA69M / 46.3 41.5
5 [ClipBERT-CVPR2021] Visual Genome & COCO 60.3 / 37.4
No Pre-Training
Rank Name Techniques and Insights Video Encoder Text Encoder TGIF-Frame MSVD-QA MSRVTT-QA
1 [DMRVSG-NAACL2022] Memory, Graph Image caption and scene parser, Swin Transformer RoBERTa 62.5 / 41.6
2 [HQGA-AAAI2022] Modular Networks, Graph, Hierarchical Learning, Multi-Granularity RN, RX(3D), RoI BERT 61.3 41.2 38.6
3 [PGAT-ACMMM2021] Graph, Hierarchical Learning, Multi-Granularity RN, RX(3D), RoI Glove 61.1 39.0 38.1
4 [HOSTR-IJCAI2021] Modular Networks, Graph, Hierarchical Learning RN, RX(3D), RoI Glove 58.2 39.4 35.9
5 [HAIR-ICCV2021] Graph, Memory, Hierarchical Learning RoI Glove 60.2 37.5 36.9

Datasets

Name Links Types Source #Video/#Qustion/Video Length(s) Annotation
VideoQA(FIB) [Paper], [Dataset] VideoQA, Factoid Multiple Source 109K/390K/- Auto
VideoQA [Paper], [Dataset] VideoQA, Factoid Web Videos 18K/174K/90 Auto, Man
MovieQA [Paper], [Dataset], [Code] MM VideoQA, Factoid Movie 6.7K/6.4K/202 Man
TGIF-QA [Paper], [Dataset] VideoQA, Inference GIF 71K/165K/3 Auto, Man
MovieFIB [Paper], [Dataset], [Code] MM VideoQA, Factoid Movie 118K/348K/4.1 Auto
MSVD-QA [Paper], [Dataset] VideoQA, Factoid Web Videos 1.9K/50K/10 Auto
MSRVTT-QA [Paper], [Dataset] VideoQA, Factoid Web Videos 10K/243K/15 Auto
YouTube2Text-QA [Paper] VideoQA, Factoid MSVD 1.9K/48K/- Auto
MarioQA [Paper], [Dataset], [Code] MM VideoQA, Factoid Gameplay 92K/92K/3-6 Auto
PororoQA [Paper], [Dataset], [Code] MM VideoQA, Factoid Cartoon 171/8.9K/8.3 Man
TVQA [Paper], [Dataset], [Code] MM VideoQA, Factoid TV 21K/152K/76 Man
SVQA [Paper], [Dataset], [Code] VideoQA, Inference Synthetic Videos 12K/118K/- Auto
Social-IQ [Paper], [Dataset] MM VideoQA, Inference Web Videos 1.2K/7.5K/- Man
ActivityNet-QA [Paper], [Dataset] VideoQA, Factoid Web Videos 5.8K/58K/180 Man
EgoVQA [Paper], [Dataset] VideoQA, Factoid Egocentric Videos 520/580/20-100 Man
CLERVER [Paper], [Dataset], [Code] VideoQA, Inference Synthetic Videos 10K/305K/5 Auto
TVQA+ [Paper], [Dataset], [Code] MM VideoQA, Factoid TV 4.1K/29K/61 Man
KnowIT VQA [Paper], [Dataset], [Code] KB VideoQA, Inference TV 12K/24K/20 Man
TutorialVQA [Paper], [Dataset] VideoQA, Inference Tutorial videos 408/6.1K/- Man
LifeQA [Paper], [Dataset], [Code] MM VideoQA, Factoid Web Videos 275/2.3K/74 Man
PsTuts-VQA [Paper], [Dataset] KB VideoQA, Inference Tutorial Videos 76/17K/262 Man
How2QA [Paper], [Dataset], [Code] MM VideoQA, Factoid Web Videos 22K/44K/60 Man
V2C-QA [Paper], [Dataset], [Code] CS VideoQA, Inference Web Videos 1.5K/37K/- Auto
NExT-QA [Paper], [Dataset], [Code] VideoQA, Inference Web Videos 5.4K/52K/44 Man
AGQA [Paper], [Dataset], [Code] VideoQA, Inference Homemade videos 9.6K/192M/30 Auto
HowToVQA69M [Paper], [Dataset], [Code] VideoQA, Factoid Web Videos 69M/69M/12 Auto
iVQA [Paper], [Dataset], [Code] VideoQA, Factoid Web Videos 10K/10K/18 Man
SUTD-TrafficQA [Paper], [Dataset], [Code] VideoQA, Inference Traffic scenes 10K/62K/- Man
Env-QA [Paper], [Dataset] MM VideoQA, Factoid Egocentric Videos 23K/85K/20 Auto, Man
Pano-AVQA [Paper], [Dataset] MM VideoQA, Factoid 360° Videos 5.4K/51.7K/5 Man
DramaQA [Paper], [Dataset], [Code] MM VideoQA, Inference TV 23K/17K/- Man
STAR [Paper], [Dataset], [Code] VideoQA, Inference Homemade videos 22K/60K/- Auto
TGIF-QA-R [Paper], [Dataset] VideoQA, Inference GIF 71K/165K/3 Auto
KnowIT-X VQA [Paper], [Dataset] KB VideoQA, Inference TV 12.1K/21.4K/20 Man
ASRL-QA [Paper], [Dataset] VideoQA, Factoid Web Videos 35K/162K/36.2 Auto
Charades-SRL-QA [Paper], [Dataset] VideoQA, Factoid Homemade videos 9.5K/71K/29 Auto
MedVidQA [Paper], [Dataset] VideoQA, Factoid Medical Instructional Videos 899/3K/4 Man
NEWSKVQA [Paper], [Dataset] KB VideoQA, Inference News Videos 12K/1M/30 Auto
VQuAD [Paper], [Dataset] VideoQA, Inference Synthetic Videos 7K/1.3M/- Auto
AGQA 2.0 [Paper], [Dataset] VideoQA, Inference Homemade Videos 9.6K/4.55M/30 Auto
Causal-VidQA [Paper], [Dataset] VideoQA, Inference Web Videos 26K/107k/9 Man
MUSIC-AVQA [Paper], [Dataset] MM VideoQA, Factoid Music Videos 9.3K/45K/60 Man
WebVidQA3M [Paper], [Dataset] VideoQA, Factoid Web Videos 2M/3MK/4 Auto
FIBER [Paper], [Dataset] VideoQA, Factoid Web Videos 28K/28K/10 Man
WildQA [Paper], [Dataset] VideoQA, Factoid In-the-wild Videos 369/916/71.22 Man

Paper Lists

Survey

  1. Video Question Answering: Datasets, Algorithms and Challenges arXiv 2022 [Paper].
  2. Video question answering: a survey of models and datasets Mobile Networks and Applications 2021 [Paper].
  3. Video Question-Answering Techniques, Benchmark Datasets and Evaluation Metrics Leveraging Video Captioning: A Comprehensive Survey IEEE Acess 2021 [Paper].
  4. Recent Advances in Video Question Answering: A Review of Datasets and Methods ICPR Workshops 2021 [Paper].

Early Works (Varients of RNNs)

  1. Uncovering the temporal context for video question answering IJCV 2017 [paper][Code].
  2. Tgif-qa: Toward spatio-temporal reasoning in visual question answering CVPR 2017 [paper][Code].
  3. End-to-end concept word detection for video captioning, retrieval, and question answering CVPR 2017 [paper][3rd Party Code].
  4. Video question answering via gradually refined attention over appearance and motion ACMMM 2017 [paper][Code].
  5. Video question answering via hierarchical dual-level attention network learning ACMMM 2017 [paper].
  6. Video question answering via attribute-augmented attention network learning SIGIR 2017 [paper].
  7. MarioQA: Answering Questions by Watching Gameplay Videos ICCV 2017 [paper][Code].
  8. Video Question Answering via Hierarchical Spatio-Temporal Attention Networks IJCAI 2017 [paper][Code].
  9. Unifying the video and question attentions for open-ended video question answering TIP 2017 [paper][Code].
  10. Tvqa: Localized, compositional video question answering EMNLP 2018 [paper][Code].
  11. Explore multi-step reasoning in video question answering ACMMM 2018 [paper][Code].
  12. Hierarchical relational attention for video question answering ICIP 2018 [paper].
  13. Temporal Attention and Consistency Measuring for Video Question Answering ICMI 2020 [paper].
  14. Video Question Answering with Spatio-Temporal Reasoning IJCV 2019 [paper].

Memory Networks

  1. DeepStory: Video Story QA by Deep Embedded Memory Networks IJCAI 2017 [paper][Code].
  2. Motion-appearance co-memory networks for video question answering CVPR 2018 [paper].
  3. Multimodal dual attention memory for video story question answering ECCV 2018 [paper].
  4. Movie Question Answering: Remembering the Textual Cues for Layered Visual Contents AAAI 2018 [paper].
  5. A better way to attend: Attention with trees for video question answering TIP 2018 [paper].
  6. The forgettable-watcher model for video question answering Neurocomputing 2018 [paper].
  7. Heterogeneous memory enhanced multimodal attention model for video question answering CVPR 2019 [paper][Code].
  8. Progressive attention memory network for movie story question answering CVPR 2019 [paper].
  9. Memory augmented deep recurrent neural network for video question answering TCSVT 2019 [paper].
  10. Long-term video question answering via multimodal hierarchical memory attentive networks TCSVT 2020 [paper].

Transformer

  1. Beyond rnns: Positional self-attention with co-attention for video question answering AAAI 2019 [paper][Code].
  2. Multi-Question Learning for Visual Question Answering AAAI 2020 [paper].
  3. BERT Representations for Video Question Answering WACV 2020 [paper].
  4. HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training EMNLP 2020 [paper][Code].
  5. MMFT-BERT: Multimodal Fusion Transformer with BERT Encodings for Visual Question Answering EMNLP 2020 [paper][Code].
  6. Action-Centric Relation Transformer Network for Video Question Answering TCSVT 2020 [paper][Code].
  7. Less is more: Clipbert for video-and-language learning via sparse sampling CVPR 2021 [paper][Code].
  8. Just Ask: Learning to Answer Questions from Millions of Narrated Videos ICCV 2021 [paper][Code].
  9. On the hidden treasure of dialog in video question answering ICCV 2021 [paper][Code].
  10. MERLOT: Multimodal Neural Script Knowledge Models NIPS 2021 [paper][Code].
  11. Learning from Inside: Self-driven Siamese Sampling and Reasoning for Video Question Answering NIPS 2021 [paper].
  12. A comparative study of language transformers for video question answering Neurocomputing 2021 [paper].
  13. Temporal Pyramid Transformer with Multimodal Interaction for Video Question Answering arXiv 2021 [paper][Code].
  14. Violet: End-to-end video-language transformers with masked visual-token modeling arXiv 2021 [paper][Code].
  15. Revitalize Region Feature for Democratizing Video-Language Pre-training arXiv 2021 [paper][Code].
  16. Revisiting the Video in Video-Language Understanding CVPR 2022 [paper]
  17. MERLOT Reserve: Multimodal Neural Script Knowledge through Vision and Language and Sound CVPR 2022 [paper][Code]
  18. Video Graph Transformer for Video Question Answering ECCV 2022 [paper][Code]

Graph Neural Networks

  1. Location-Aware Graph Convolutional Networks for Video Question Answering AAAI 2020 [paper].
  2. Reasoning with heterogeneous graph alignment for video question answering AAAI 2020 [paper][Code].
  3. Knowledge-based video question answering with unsupervised scene descriptions ECCV 2020 [paper].
  4. Bridge To Answer: Structure-Aware Graph Interaction Network for Video Question Answering CVPR 2021 [paper].
  5. HAIR: Hierarchical Visual-Semantic Relational Reasoning for Video Question Answering ICCV 2021 [paper].
  6. Progressive Graph Attention Network for Video Question Answering ACMMM 2021 [paper][Code].
  7. Attend What You Need: Motion-Appearance Synergistic Networks for Video Question Answering ACL 2021 [paper][Code].
  8. DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question Answering TMM 2021 [paper][Code].
  9. Object-Centric Representation Learning for Video Question Answering IJCNN 2021 [paper].
  10. Video as Conditional Graph Hierarchy for Multi-Granular Question Answering AAAI 2022 [paper][Code].
  11. Cross-Attentional Spatio-Temporal Semantic Graph Networks for Video Question Answering TIP 2022 [paper].
  12. (2.5+1)D Spatio-Temporal Scene Graphs for Video Question Answering AAAI 2022 [paper].
  13. Rethinking Multi-Modal Alignment in Video Question Answering from Feature and Sample Perspectives AAAI 2022 [paper].
  14. Dynamic Multistep Reasoning based on Video Scene Graph for Video Question Answering NAACL 2022[paper].

Modular Networks

  1. Question-aware tube-switch network for video question answering ACMMM 2019 [paper].
  2. Open-Ended Long-Form Video Question Answering via Hierarchical Convolutional Self-Attention Networks IJCAI 2019 [paper].
  3. Hierarchical conditional relation networks for video question answering CVPR 2020 [paper][Code].
  4. Neural Reasoning, Fast and Slow, for Video Question Answering IJCNN 2020 [paper].
  5. Hierarchical Object-oriented Spatio-Temporal Reasoning for Video Question Answering IJCAI 2021 [paper].

Neural-Symbolic

  1. Clevrer: Collision events for video representation and reasoning ICLR 2020 [paper][Code].
  2. Grounding physical concepts of objects and events through dynamic visual reasoning ICLR 2021 [paper][Code].
  3. Dynamic Visual Reasoning by Learning Differentiable Physics Models from Video and Language NIPS 2021 [paper][Code].
  4. STAR: A Benchmark for Situated Reasoning in Real-World Videos NIPS 2021 [paper][Code].

Flexibly Designed Networks

  1. A joint sequence fusion model for video question answering and retrieval ECCV 2018 [paper][Code].
  2. Structured Two-Stream Attention Network for Video Question Answering AAAI 2019 [paper].
  3. Multi-interaction Network with Object Relation for Video Question Answering ACMMM 2019 [paper].
  4. Learnable aggregating net with diversity learning for video question answering ACMMM 2019 [paper].
  5. Compositional attention networks with two-stream fusion for video question answering TIP 2019 [paper].
  6. Frame augmented alternating attention network for video question answering TMM 2019 [paper][Code].
  7. Spatiotemporal-Textual Co-Attention Network for Video Question Answering TOMM 2019 [paper].
  8. Modality shifting attention network for multi-modal video question answering CVPR 2020 [paper].
  9. Divide and conquer: Question-guided spatio-temporal contextual attention for video question answering AAAI 2020 [paper].
  10. Dual Hierarchical Temporal Convolutional Network with QA-Aware Dynamic Normalization for Video Story Question Answering ACMMM 2020 [paper].
  11. Long video question answering: A Matching-guided Attention Model PR 2020 [paper].
  12. SUTD-TrafficQA: A Question Answering Benchmark and an Efficient Network for Video Reasoning Over Traffic Events CVPR 2021 [paper].
  13. Env-QA: A Video Question Answering Benchmark for Comprehensive Understanding of Dynamic Environments ICCV 2021 [paper].
  14. Self-Supervised Pre-Training and Contrastive Representation Learning for Multiple-Choice Video QA AAAI 2021 [paper].
  15. Pairwise VLAD Interaction Network for Video Question Answering ACMMM 2021 [paper].
  16. Question-Guided Erasing-Based Spatiotemporal Attention Learning for Video Question Answering TNNLS 2021 [paper].

Others

Reforced Decoder
  1. Open-Ended Long-form Video Question Answering via Adaptive Hierarchical Reinforced Networks IJCAI 2018 [paper].
  2. Long-form video question answering via dynamic hierarchical reinforced networks TIP 2019 [paper].
  3. Open-Ended Video Question Answering via Multi-Modal Conditional Adversarial Networks TIP 2020 [paper].
Knowledge Incorporation
  1. Video question answering via knowledge-based progressive spatial-temporal attention network TOMM 2019 [paper].
  2. KnowIT VQA: Answering Knowledge-Based Questions about Videos AAAI 2020 [paper][Code].
  3. Multichannel Attention Refinement for Video Question Answering TOMM 2020 [paper].
  4. Transferring Domain-Agnostic Knowledge in Video Question Answering BMVC 2021 [paper].
Commonsense Incorporation
  1. Video2Commonsense: Generating Commonsense Descriptions to Enrich Video Captioning EMNLP 2020 [paper].
  2. iPerceive: Applying Common-Sense Reasoning to Multi-Modal Dense Video Captioning and Video Question Answering BMVC 2021 [paper][Code].
Multi-task Learning
  1. Gaining extra supervision via multi-task learning for multi-modal video question answering IJCNN 2019 [paper].
  2. TVQA+: Spatio-Temporal Grounding for Video Question Answering ACL 2020 [paper][Code].
  3. Adversarial Multimodal Network for Movie Story Question Answering TMM 2020 [paper].
  4. Learning to Answer Questions in Dynamic Audio-Visual Scenarios CVPR 2022 [paper][Code].
Input Data
  1. Data augmentation techniques for the Video Question Answering task ECCV 2020 [paper].
  2. Video Question Answering Using Language-Guided Deep Compressed-Domain Video Feature ICCV 2021 [paper].
Data Bias
  1. On Modality Bias in the TVQA Dataset BMVC 2020 [paper][Code].
  2. What Gives the Answer Away? Question Answering Bias Analysis on Video QA Datasets arXiv 2020 [paper].
Causality
  1. Invariant Grounding for Video Question Answering CVPR 2022 [paper][Code].
  2. Equivariant and Invariant Grounding for Video Question Answering ACMMM 2022 [paper][Code].

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published