Text related VQA is a fine-grained direction of the VQA task, which only focuses on the question that requires to read the textual content shown in the input image.
- NewsVideoQA dataset (WACV2023) [Project][Paper]
- ViteVQA dataset (NeurIPS 2022) [Project][Paper]
- VisualMRC dataset (AAAI 2021) [Project][Paper]
- EST-VQA dataset (CVPR 2020) [Project][Paper]
- DOC-VQA dataset (CVPR Workshop 2020) [Project][Paper]
- Text-VQA dataset (CVPR 2019) [Project][Paper]
- ST-VQA dataset (ICCV 2019) [Project][Paper]
- OCR-VQA dataset (ICDAR 2019) [Project][Paper]
Dataset | #Train+Val Img | #Train+Val Que | #Test Img | #Test Que | Image Source | Language |
---|---|---|---|---|---|---|
Text-VQA | 25,119 | 39,602 | 3,353 | 5,734 | [1] | EN |
ST-VQA | 19,027 | 26,308 | 2,993 | 4,163 | [2, 3, 4, 5, 6, 7, 8] | EN |
OCR-VQA | 186,775 | 901,717 | 20,797 | 100,429 | [9] | EN |
EST-VQA | 17,047 | 19,362 | 4,000 | 4,525 | [4, 5, 8, 10, 11, 12, 13] | EN+CH |
DOC-VQA | 11,480 | 44,812 | 1,287 | 5,188 | [14] | EN |
VisualMRC | 7,960 | 23,854 | 2,237 | 6,708 | self-collected webpage screenshot | EN |
ViteVQA(Task1Spilt1) | 5,969 | 19,840 | 971 | 3,183 | YouTuBe | EN |
Image Source:
[1] OpenImages: A public dataset for large-scale multi-label and multi-class image classification (v3) [dataset]
[2] Imagenet: A large-scale hierarchical image database [dataset]
[3] Vizwiz grand challenge: Answering visual questions from blind people [dataset]
[4] ICDAR 2013 robust reading competition [dataset]
[5] ICDAR 2015 competition on robust reading [dataset]
[6] Visual Genome: Connecting language and vision using crowdsourced dense image annotations [dataset]
[7] Image retrieval using textual cues [dataset]
[8] Coco-text: Dataset and benchmark for text detection and recognition in natural images [dataset]
[9] Judging a book by its cover [dataset]
[10] Total Text [dataset]
[11] SCUT-CTW1500 [dataset]
[12] MLT [dataset]
[13] Chinese Street View Text [dataset]
[14] UCSF Industry Document Library [dataset]
ICDAR 2021 COMPETITION On Document Visual Question Answering (DocVQA) Submission Deadline: 31st March 2021 [Challenge]
Document Visual Question Answering (CVPR 2020 Workshop on Text and Documents in the Deep Learning Era Submission Deadline: 30 April 2020 [Challenge]
- [RUArt] RUArt: A Novel Text-Centered Solution for Text-Based Visual Question Answering(T-MM) [Paper][Project]
- [BOV++] Beyond OCR + VQA: Towards end-to-end reading and reasoning for robust and accurate textvqa (PR) [Paper]
- [TAG] TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation (PR) [Paper][Project]
- [ViteVQA] Towards Video Text Visual Question Answering: Benchmark and Baseline (NeurIPS) [Paper][Project]
- [LaTr] LaTr: Layout-Aware Transformer for Scene-Text VQA (CVPR) [Paper][Unofficial Code]
- [TIG] Text-instance graph: Exploring the relational semantics for text-based visual question answering (PR)[Paper]
- [SMA] Structured Multimodal Attentions for TextVQA (T-PAMI)[Paper][Project]
- [DA-Net] Toward 3D Spatial Reasoning for Human-like Text-based Visual Question Answering (arXiv)[Paper]
- [SenseGATE] SceneGATE: Scene-Graph based co-Attention networks for TExt visual question answering (arXiv)[Paper]
- [MLCI] Multi-level, multi-modal interactions for visual question answering over text in images (WWW)[Paper][Project]
- [TWA] From Token to Word: OCR Token Evolution via Contrastive Learning and Semantic Matching for Text-VQA (ACMMM)[Paper][Project]
- [TAG] TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation (BMVC)[Paper][Project]
- [MGEN] MODALITY-SPECIFIC MULTIMODAL GLOBAL ENHANCED NETWORK FOR TEXT-BASED VISUAL QUESTION ANSWERING (ICME)[Paper]
- [SC-Net] TOWARDS ESCAPING FROM LANGUAGE BIAS AND OCR ERROR: SEMANTICS-CENTERED TEXT VISUAL QUESTION ANSWERING (ICME)[Paper]
- [EKTVQA] EKTVQA: Generalized Use of External Knowledge to Empower Scene Text in Text-VQA (Access)[Paper]
- [Two-stage fusion] Two-stage Multimodality Fusion for High-performance Text-based Visual Question Answering (ACCV)[Paper]
- [VisualMRC] VisualMRC: Machine Reading Comprehension on Document Images (AAAI) [Paper][Project]
- [SSBaseline] Simple is not Easy: A Simple Strong Baseline for TextVQA and TextCaps (AAAI) [Paper][code]
- [SA-M4C] Spatially Aware MultimodalTransformers for TextVQA (ECCV) [Paper][Project][Code]
- [EST-VQA] On the General Value of Evidence, and Bilingual Scene-Text Visual Question Answering (CVPR) [Paper]
- [M4C] Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA (CVPR) [Paper][Project]
- [LaAP-Net] Finding the Evidence: Localization-aware Answer Prediction for TextVisual Question Answering (COLING) [Paper]
- [CRN] Cascade Reasoning Network for Text-basedVisual Question Answering (ACM MM) [Paper][Project]
- [Text-VQA/LoRRA] Towards VQA Models That Can Read (CVPR) [Paper][Code]
- [ST-VQA] Scene Text Visual Question Answering (ICCV) [Paper]
- [Text-KVQA] From Strings to Things: Knowledge-enabled VQA Modelthat can Read and Reason (ICCV) [Paper]
- [OCR-VQA] OCR-VQA: Visual Question Answering by Reading Text in Images (ICDAR) [Paper]
- [DiagNet] DiagNet: Bridging Text and Image [Report][Code]
- [DCD_ZJU] Winner of 2019 Text-VQA challenge [Slides]
- [Schwail] Runner-up of 2019 Text-VQA challenge [Slides]
Acc. : Accuracy I. E. : Image Encoder Q. E. : Question Encoder O. E. : OCR Token Encoder Ensem. : Ensemble
[official leaderboard(2019)] [official leaderboard(2020)]
Y-C./J. | Methods | Acc. | I. E. | Q. E. | OCR | O. E. | Output | Ensem. |
---|---|---|---|---|---|---|---|---|
2019--CVPR | LoRRA | 26.64 | Faster R-CNN | GloVe | Rosetta-ml | FastText | Classification | N |
2019--N/A | DCD_ZJU | 31.44 | Faster R-CNN | BERT | Rosetta-ml | FastText | Classification | Y |
2020--CVPR | M4C | 40.46 | Faster R-CNN (ResNet-101) | BERT | Rosetta-en | FastText | Decoder | N |
2020--Challenge | Xiangpeng | 40.77 | ||||||
2020--Challenge | colab_buaa | 44.73 | ||||||
2020--Challenge | CVMLP(SAM) | 44.80 | ||||||
2020--Challenge | NWPU_Adelaide_Team(SMA) | 45.51 | Faster R-CNN | BERT | BDN | Graph Attention | Decoder | N |
2020--ECCV | SA-M4C | 44.6* | Faster R-CNN (ResNext-152) | BERT | Google-OCR | FastText+PHOC | Decoder | N |
2020--arXiv | TAP | 53.97* | Faster R-CNN (ResNext-152) | BERT | Microsoft-OCR | FastText+PHOC | Decoder | N |
2022--arXiv | TAG | 53.63 | Faster R-CNN (ResNext-152) | BERT | Microsoft-OCR | FastText+PHOC | Decoder | N |
* Using external data for training.
[official leaderboard]
T1 : Strongly Contextualised Task
T2 : Weakly Contextualised Task
T3 : Open Dictionary
Y-C./J. | Methods | Acc. (T1/T2/T3) | I. E. | Q. E. | OCR | O. E. | Output | Ensem. |
---|---|---|---|---|---|---|---|---|
2020--CVPR | M4C | na/na/0.4621 | Faster R-CNN (ResNet-101) | BERT | Rosetta-en | FastText | Decoder | N |
2020--Challenge | SMA | 0.5081/0.3104/0.4659 | Faster | BERT | BDN | Graph Attention | Decoder | N |
2020--ECCV | SA-M4C | na/na/0.5042 | Faster R-CNN (ResNext-152) | BERT | Google-OCR | FastText+PHOC | Decoder | N |
2020--arXiv | TAP | na/na/0.5967 | Faster R-CNN (ResNext-152) | BERT | Microsoft-OCR | FastText+PHOC | Decoder | N |
2022--arXiv | TAG | na/na/0.6019 | Faster R-CNN (ResNext-152) | BERT | Microsoft-OCR | FastText+PHOC | Decoder | N |
Y-C./J. | Methods | Acc. | I. E. | Q. E. | OCR | O. E. | Output | Ensem. |
---|---|---|---|---|---|---|---|---|
2020--CVPR | M4C | 63.9 | Faster R-CNN | BERT | Rosetta-en | FastText | Decoder | N |