This repository contains a list of codes, leaderboards, dataset and paper lists of Video Question Answering (VideoQA). If you found any error, please don't hesitate to open an issue or pull request.
If you find this repository helpful for your work, please kindly cite the following paper. The Bibtex are listed below:
@inproceedings{zhong2022Video, title={Video Question Answering: Datasets, Algorithms and Challenges}, author={Yaoyao Zhong and Junbin Xiao and Wei Ji and Yicong Li and Weihong Deng and Tat-Seng Chua}, booktitle={The 2022 Conference on Empirical Methods in Natural Language Processing}, year={2022}, }
Contributed by Yaoyao Zhong, Junbin Xiao and Wei Ji.
Thanks for supports from our adviser Tat-Seng Chua!
Rank | Name | Techniques and Insights | NExT-Val | NExT-Test |
---|---|---|---|---|
/ | Human Performance | / | 88.4 | / |
1 | [VGT-ECCV2022] | Graph, Transformer, Hierarchical Learning, Multi-Granularity | 55.02 | 53.68 |
2 | [ATP-CVPR2022] | Transformer, Cross-modal Pre-training and Fine-tuning | 54.3 | / |
3 | [EIGV-ACMMM2022] | Causality, Graph | / | 53.7 |
4 | [(2.5+1)D-Transformer-AAAI2022] | Graph, Transformer, Multi-Granularity | 53.4 | / |
5 | [MMA-arXiv2022] | Graph, Hierarchical Learning | 53.3 | 52.4 |
6 | [HQGA-AAAI2022] | Modular Networks, Graph, Hierarchical Learning, Multi-Granularity | 51.42 | 51.75 |
7 | [IGV-CVPR2022] | Causality, Graph | / | 51.34 |
8 | [HGA-AAAI2020] | Graph | 49.74 | 50.01 |
9 | [HCRN-CVPR2020] | Modular Networks, Hierarchical Learning | 48.20 | 48.98 |
Rank | Name | Cross-Modal Pre-Training | TGIF-Frame | MSVD-QA | MSRVTT-QA |
---|---|---|---|---|---|
1 | [MERLOT-NIPS2021] | Youtube-Temporal-180M & Conceptual Captions-3M | 69.5 | / | 43.1 |
2 | [VIOLET] | WebVid2.5M & Youtube-Temporal-180M & Conceptual Captions-3M | 68.9 | 47.9 | 43.9 |
3 | [SSRea-NIPS2021] | Visual Genome & COCO | 60.2 | 45.5 | 41.6 |
4 | [VQA-T-ICCV2021] | H2VQA69M | / | 46.3 | 41.5 |
5 | [ClipBERT-CVPR2021] | Visual Genome & COCO | 60.3 | / | 37.4 |
Rank | Name | Techniques and Insights | Video Encoder | Text Encoder | TGIF-Frame | MSVD-QA | MSRVTT-QA |
---|---|---|---|---|---|---|---|
1 | [DMRVSG-NAACL2022] | Memory, Graph | Image caption and scene parser, Swin Transformer | RoBERTa | 62.5 | / | 41.6 |
2 | [HQGA-AAAI2022] | Modular Networks, Graph, Hierarchical Learning, Multi-Granularity | RN, RX(3D), RoI | BERT | 61.3 | 41.2 | 38.6 |
3 | [PGAT-ACMMM2021] | Graph, Hierarchical Learning, Multi-Granularity | RN, RX(3D), RoI | Glove | 61.1 | 39.0 | 38.1 |
4 | [HOSTR-IJCAI2021] | Modular Networks, Graph, Hierarchical Learning | RN, RX(3D), RoI | Glove | 58.2 | 39.4 | 35.9 |
5 | [HAIR-ICCV2021] | Graph, Memory, Hierarchical Learning | RoI | Glove | 60.2 | 37.5 | 36.9 |
Name | Links | Types | Source | #Video/#Qustion/Video Length(s) | Annotation |
---|---|---|---|---|---|
VideoQA(FIB) |
[Paper], [Dataset] | VideoQA, Factoid | Multiple Source | 109K/390K/- | Auto |
VideoQA |
[Paper], [Dataset] | VideoQA, Factoid | Web Videos | 18K/174K/90 | Auto, Man |
MovieQA |
[Paper], [Dataset], [Code] | MM VideoQA, Factoid | Movie | 6.7K/6.4K/202 | Man |
TGIF-QA |
[Paper], [Dataset] | VideoQA, Inference | GIF | 71K/165K/3 | Auto, Man |
MovieFIB |
[Paper], [Dataset], [Code] | MM VideoQA, Factoid | Movie | 118K/348K/4.1 | Auto |
MSVD-QA |
[Paper], [Dataset] | VideoQA, Factoid | Web Videos | 1.9K/50K/10 | Auto |
MSRVTT-QA |
[Paper], [Dataset] | VideoQA, Factoid | Web Videos | 10K/243K/15 | Auto |
YouTube2Text-QA |
[Paper] | VideoQA, Factoid | MSVD | 1.9K/48K/- | Auto |
MarioQA |
[Paper], [Dataset], [Code] | MM VideoQA, Factoid | Gameplay | 92K/92K/3-6 | Auto |
PororoQA |
[Paper], [Dataset], [Code] | MM VideoQA, Factoid | Cartoon | 171/8.9K/8.3 | Man |
TVQA |
[Paper], [Dataset], [Code] | MM VideoQA, Factoid | TV | 21K/152K/76 | Man |
SVQA |
[Paper], [Dataset], [Code] | VideoQA, Inference | Synthetic Videos | 12K/118K/- | Auto |
Social-IQ |
[Paper], [Dataset] | MM VideoQA, Inference | Web Videos | 1.2K/7.5K/- | Man |
ActivityNet-QA |
[Paper], [Dataset] | VideoQA, Factoid | Web Videos | 5.8K/58K/180 | Man |
EgoVQA |
[Paper], [Dataset] | VideoQA, Factoid | Egocentric Videos | 520/580/20-100 | Man |
CLERVER |
[Paper], [Dataset], [Code] | VideoQA, Inference | Synthetic Videos | 10K/305K/5 | Auto |
TVQA+ |
[Paper], [Dataset], [Code] | MM VideoQA, Factoid | TV | 4.1K/29K/61 | Man |
KnowIT VQA |
[Paper], [Dataset], [Code] | KB VideoQA, Inference | TV | 12K/24K/20 | Man |
TutorialVQA |
[Paper], [Dataset] | VideoQA, Inference | Tutorial videos | 408/6.1K/- | Man |
LifeQA |
[Paper], [Dataset], [Code] | MM VideoQA, Factoid | Web Videos | 275/2.3K/74 | Man |
PsTuts-VQA |
[Paper], [Dataset] | KB VideoQA, Inference | Tutorial Videos | 76/17K/262 | Man |
How2QA |
[Paper], [Dataset], [Code] | MM VideoQA, Factoid | Web Videos | 22K/44K/60 | Man |
V2C-QA |
[Paper], [Dataset], [Code] | CS VideoQA, Inference | Web Videos | 1.5K/37K/- | Auto |
NExT-QA |
[Paper], [Dataset], [Code] | VideoQA, Inference | Web Videos | 5.4K/52K/44 | Man |
AGQA |
[Paper], [Dataset], [Code] | VideoQA, Inference | Homemade videos | 9.6K/192M/30 | Auto |
HowToVQA69M |
[Paper], [Dataset], [Code] | VideoQA, Factoid | Web Videos | 69M/69M/12 | Auto |
iVQA |
[Paper], [Dataset], [Code] | VideoQA, Factoid | Web Videos | 10K/10K/18 | Man |
SUTD-TrafficQA |
[Paper], [Dataset], [Code] | VideoQA, Inference | Traffic scenes | 10K/62K/- | Man |
Env-QA |
[Paper], [Dataset] | MM VideoQA, Factoid | Egocentric Videos | 23K/85K/20 | Auto, Man |
Pano-AVQA |
[Paper], [Dataset] | MM VideoQA, Factoid | 360° Videos | 5.4K/51.7K/5 | Man |
DramaQA |
[Paper], [Dataset], [Code] | MM VideoQA, Inference | TV | 23K/17K/- | Man |
STAR |
[Paper], [Dataset], [Code] | VideoQA, Inference | Homemade videos | 22K/60K/- | Auto |
TGIF-QA-R |
[Paper], [Dataset] | VideoQA, Inference | GIF | 71K/165K/3 | Auto |
KnowIT-X VQA |
[Paper], [Dataset] | KB VideoQA, Inference | TV | 12.1K/21.4K/20 | Man |
ASRL-QA |
[Paper], [Dataset] | VideoQA, Factoid | Web Videos | 35K/162K/36.2 | Auto |
Charades-SRL-QA |
[Paper], [Dataset] | VideoQA, Factoid | Homemade videos | 9.5K/71K/29 | Auto |
MedVidQA |
[Paper], [Dataset] | VideoQA, Factoid | Medical Instructional Videos | 899/3K/4 | Man |
NEWSKVQA |
[Paper], [Dataset] | KB VideoQA, Inference | News Videos | 12K/1M/30 | Auto |
VQuAD |
[Paper], [Dataset] | VideoQA, Inference | Synthetic Videos | 7K/1.3M/- | Auto |
AGQA 2.0 |
[Paper], [Dataset] | VideoQA, Inference | Homemade Videos | 9.6K/4.55M/30 | Auto |
Causal-VidQA |
[Paper], [Dataset] | VideoQA, Inference | Web Videos | 26K/107k/9 | Man |
MUSIC-AVQA |
[Paper], [Dataset] | MM VideoQA, Factoid | Music Videos | 9.3K/45K/60 | Man |
WebVidQA3M |
[Paper], [Dataset] | VideoQA, Factoid | Web Videos | 2M/3MK/4 | Auto |
FIBER |
[Paper], [Dataset] | VideoQA, Factoid | Web Videos | 28K/28K/10 | Man |
WildQA |
[Paper], [Dataset] | VideoQA, Factoid | In-the-wild Videos | 369/916/71.22 | Man |
- Video Question Answering: Datasets, Algorithms and Challenges
arXiv 2022
[Paper]. - Video question answering: a survey of models and datasets
Mobile Networks and Applications 2021
[Paper]. - Video Question-Answering Techniques, Benchmark Datasets and Evaluation Metrics Leveraging Video Captioning: A Comprehensive Survey
IEEE Acess 2021
[Paper]. - Recent Advances in Video Question Answering: A Review of Datasets and Methods
ICPR Workshops 2021
[Paper].
- Uncovering the temporal context for video question answering
IJCV 2017
[paper][Code]. - Tgif-qa: Toward spatio-temporal reasoning in visual question answering
CVPR 2017
[paper][Code]. - End-to-end concept word detection for video captioning, retrieval, and question answering
CVPR 2017
[paper][3rd Party Code]. - Video question answering via gradually refined attention over appearance and motion
ACMMM 2017
[paper][Code]. - Video question answering via hierarchical dual-level attention network learning
ACMMM 2017
[paper]. - Video question answering via attribute-augmented attention network learning
SIGIR 2017
[paper]. - MarioQA: Answering Questions by Watching Gameplay Videos
ICCV 2017
[paper][Code]. - Video Question Answering via Hierarchical Spatio-Temporal Attention Networks
IJCAI 2017
[paper][Code]. - Unifying the video and question attentions for open-ended video question answering
TIP 2017
[paper][Code]. - Tvqa: Localized, compositional video question answering
EMNLP 2018
[paper][Code]. - Explore multi-step reasoning in video question answering
ACMMM 2018
[paper][Code]. - Hierarchical relational attention for video question answering
ICIP 2018
[paper]. - Temporal Attention and Consistency Measuring for Video Question Answering
ICMI 2020
[paper]. - Video Question Answering with Spatio-Temporal Reasoning
IJCV 2019
[paper].
- DeepStory: Video Story QA by Deep Embedded Memory Networks
IJCAI 2017
[paper][Code]. - Motion-appearance co-memory networks for video question answering
CVPR 2018
[paper]. - Multimodal dual attention memory for video story question answering
ECCV 2018
[paper]. - Movie Question Answering: Remembering the Textual Cues for Layered Visual Contents
AAAI 2018
[paper]. - A better way to attend: Attention with trees for video question answering
TIP 2018
[paper]. - The forgettable-watcher model for video question answering
Neurocomputing 2018
[paper]. - Heterogeneous memory enhanced multimodal attention model for video question answering
CVPR 2019
[paper][Code]. - Progressive attention memory network for movie story question answering
CVPR 2019
[paper]. - Memory augmented deep recurrent neural network for video question answering
TCSVT 2019
[paper]. - Long-term video question answering via multimodal hierarchical memory attentive networks
TCSVT 2020
[paper].
- Beyond rnns: Positional self-attention with co-attention for video question answering
AAAI 2019
[paper][Code]. - Multi-Question Learning for Visual Question Answering
AAAI 2020
[paper]. - BERT Representations for Video Question Answering
WACV 2020
[paper]. - HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training
EMNLP 2020
[paper][Code]. - MMFT-BERT: Multimodal Fusion Transformer with BERT Encodings for Visual Question Answering
EMNLP 2020
[paper][Code]. - Action-Centric Relation Transformer Network for Video Question Answering
TCSVT 2020
[paper][Code]. - Less is more: Clipbert for video-and-language learning via sparse sampling
CVPR 2021
[paper][Code]. - Just Ask: Learning to Answer Questions from Millions of Narrated Videos
ICCV 2021
[paper][Code]. - On the hidden treasure of dialog in video question answering
ICCV 2021
[paper][Code]. - MERLOT: Multimodal Neural Script Knowledge Models
NIPS 2021
[paper][Code]. - Learning from Inside: Self-driven Siamese Sampling and Reasoning for Video Question Answering
NIPS 2021
[paper]. - A comparative study of language transformers for video question answering
Neurocomputing 2021
[paper]. - Temporal Pyramid Transformer with Multimodal Interaction for Video Question Answering
arXiv 2021
[paper][Code]. - Violet: End-to-end video-language transformers with masked visual-token modeling
arXiv 2021
[paper][Code]. - Revitalize Region Feature for Democratizing Video-Language Pre-training
arXiv 2021
[paper][Code]. - Revisiting the Video in Video-Language Understanding
CVPR 2022
[paper] - MERLOT Reserve: Multimodal Neural Script Knowledge through Vision and Language and Sound
CVPR 2022
[paper][Code] - Video Graph Transformer for Video Question Answering
ECCV 2022
[paper][Code]
- Location-Aware Graph Convolutional Networks for Video Question Answering
AAAI 2020
[paper]. - Reasoning with heterogeneous graph alignment for video question answering
AAAI 2020
[paper][Code]. - Knowledge-based video question answering with unsupervised scene descriptions
ECCV 2020
[paper]. - Bridge To Answer: Structure-Aware Graph Interaction Network for Video Question Answering
CVPR 2021
[paper]. - HAIR: Hierarchical Visual-Semantic Relational Reasoning for Video Question Answering
ICCV 2021
[paper]. - Progressive Graph Attention Network for Video Question Answering
ACMMM 2021
[paper][Code]. - Attend What You Need: Motion-Appearance Synergistic Networks for Video Question Answering
ACL 2021
[paper][Code]. - DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question Answering
TMM 2021
[paper][Code]. - Object-Centric Representation Learning for Video Question Answering
IJCNN 2021
[paper]. - Video as Conditional Graph Hierarchy for Multi-Granular Question Answering
AAAI 2022
[paper][Code]. - Cross-Attentional Spatio-Temporal Semantic Graph Networks for Video Question Answering
TIP 2022
[paper]. - (2.5+1)D Spatio-Temporal Scene Graphs for Video Question Answering
AAAI 2022
[paper]. - Rethinking Multi-Modal Alignment in Video Question Answering from Feature and Sample Perspectives
AAAI 2022
[paper]. - Dynamic Multistep Reasoning based on Video Scene Graph for Video Question Answering
NAACL 2022
[paper].
- Question-aware tube-switch network for video question answering
ACMMM 2019
[paper]. - Open-Ended Long-Form Video Question Answering via Hierarchical Convolutional Self-Attention Networks
IJCAI 2019
[paper]. - Hierarchical conditional relation networks for video question answering
CVPR 2020
[paper][Code]. - Neural Reasoning, Fast and Slow, for Video Question Answering
IJCNN 2020
[paper]. - Hierarchical Object-oriented Spatio-Temporal Reasoning for Video Question Answering
IJCAI 2021
[paper].
- Clevrer: Collision events for video representation and reasoning
ICLR 2020
[paper][Code]. - Grounding physical concepts of objects and events through dynamic visual reasoning
ICLR 2021
[paper][Code]. - Dynamic Visual Reasoning by Learning Differentiable Physics Models from Video and Language
NIPS 2021
[paper][Code]. - STAR: A Benchmark for Situated Reasoning in Real-World Videos
NIPS 2021
[paper][Code].
- A joint sequence fusion model for video question answering and retrieval
ECCV 2018
[paper][Code]. - Structured Two-Stream Attention Network for Video Question Answering
AAAI 2019
[paper]. - Multi-interaction Network with Object Relation for Video Question Answering
ACMMM 2019
[paper]. - Learnable aggregating net with diversity learning for video question answering
ACMMM 2019
[paper]. - Compositional attention networks with two-stream fusion for video question answering
TIP 2019
[paper]. - Frame augmented alternating attention network for video question answering
TMM 2019
[paper][Code]. - Spatiotemporal-Textual Co-Attention Network for Video Question Answering
TOMM 2019
[paper]. - Modality shifting attention network for multi-modal video question answering
CVPR 2020
[paper]. - Divide and conquer: Question-guided spatio-temporal contextual attention for video question answering
AAAI 2020
[paper]. - Dual Hierarchical Temporal Convolutional Network with QA-Aware Dynamic Normalization for Video Story Question Answering
ACMMM 2020
[paper]. - Long video question answering: A Matching-guided Attention Model
PR 2020
[paper]. - SUTD-TrafficQA: A Question Answering Benchmark and an Efficient Network for Video Reasoning Over Traffic Events
CVPR 2021
[paper]. - Env-QA: A Video Question Answering Benchmark for Comprehensive Understanding of Dynamic Environments
ICCV 2021
[paper]. - Self-Supervised Pre-Training and Contrastive Representation Learning for Multiple-Choice Video QA
AAAI 2021
[paper]. - Pairwise VLAD Interaction Network for Video Question Answering
ACMMM 2021
[paper]. - Question-Guided Erasing-Based Spatiotemporal Attention Learning for Video Question Answering
TNNLS 2021
[paper].
- Open-Ended Long-form Video Question Answering via Adaptive Hierarchical Reinforced Networks
IJCAI 2018
[paper]. - Long-form video question answering via dynamic hierarchical reinforced networks
TIP 2019
[paper]. - Open-Ended Video Question Answering via Multi-Modal Conditional Adversarial Networks
TIP 2020
[paper].
- Video question answering via knowledge-based progressive spatial-temporal attention network
TOMM 2019
[paper]. - KnowIT VQA: Answering Knowledge-Based Questions about Videos
AAAI 2020
[paper][Code]. - Multichannel Attention Refinement for Video Question Answering
TOMM 2020
[paper]. - Transferring Domain-Agnostic Knowledge in Video Question Answering
BMVC 2021
[paper].
- Video2Commonsense: Generating Commonsense Descriptions to Enrich Video Captioning
EMNLP 2020
[paper]. - iPerceive: Applying Common-Sense Reasoning to Multi-Modal Dense Video Captioning and Video Question Answering
BMVC 2021
[paper][Code].
- Gaining extra supervision via multi-task learning for multi-modal video question answering
IJCNN 2019
[paper]. - TVQA+: Spatio-Temporal Grounding for Video Question Answering
ACL 2020
[paper][Code]. - Adversarial Multimodal Network for Movie Story Question Answering
TMM 2020
[paper]. - Learning to Answer Questions in Dynamic Audio-Visual Scenarios
CVPR 2022
[paper][Code].
- Data augmentation techniques for the Video Question Answering task
ECCV 2020
[paper]. - Video Question Answering Using Language-Guided Deep Compressed-Domain Video Feature
ICCV 2021
[paper].