MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering
Text-Centric Visual Question Answering (TEC-VQA) in its proper format not only facilitates human-machine interaction in text-centric visual environments but also serves as a de facto gold proxy to evaluate AI models in the domain of text-centric scene understanding. Nonetheless, most existing TEC-VQA benchmarks have focused on high-resource languages like English and Chinese. Despite pioneering works to expand multilingual QA pairs in non-text-centric VQA datasets through translation engines, the translation-based protocol encounters a substantial ''Visual-textual misalignment'' problem when applied to TEC-VQA. Specifically, it prioritizes the text in question-answer pairs while disregarding the visual text present in images. Furthermore, it fails to address complexities related to nuanced meaning, contextual distortion, language bias, and question-type diversity. In this work, we tackle multilingual TEC-VQA by introducing MTVQA, the first benchmark featuring high-quality human expert annotations across 9 diverse languages. Further, by comprehensively evaluating numerous state-of-the-art Multimodal Large Language Models (MLLMs), including GPT-4o, GPT-4V, Claude3, and Gemini, on the MTVQA dataset, it is evident that there is still large room for performance improvement, underscoring the value of the dataset. Additionally, we supply multilingual training data within the MTVQA dataset, demonstrating that straightforward fine-tuning with this data can substantially enhance multilingual TEC-VQA performance. We aspire that MTVQA will offer the research community fresh insights and stimulate further exploration in multilingual visual text comprehension.
|🍎 Project Page | 📖 Paper |📊 Dataset | 🏆 Leaderboard
2024.12.12
🌟 InternVL2.5 tests its performance on MTVQA, InternVL2.5 78B model outperforms Qwen2VL 76B and achieves the SOTA performance, congratulations to the InternVL2.5 team!2024.09.29
🌟 The Blue LM team from VIVO tests their BlueLM-V-3B performance on MTVQA. BlueLM-V-3B achieves performance comparable to GPT-4o. and it ranks the third place among all the SOTA MLLMs!2024.09.09
🌟 We test GPT-4o mini's performance on MTVQA and it performs exceptionally well among the leading lightweight MLLMs!2024.09.04
🌟 InternVL2 tests its performance on MTVQA, InternVL2 76B model outperforms GPT-4V, thanks to the InternVL2 team.2024.08.30
🌟 Qwen2VL 72B is released, outperforming GPT-4o and achieving the best performance overall, congratulations!2024.07.23
🌟 MTVQA is now supported in VLMEvalKit.2024.07.23
🌟 MTVQA is now supported in OpenCompass.2024.06.04
🌟 We are excited to launch MTVQA, the first multilingual visual text comprehension evaluation benchmark for MLLMs! MTVQA includes 9 widely-used but low-resource languages, i.t., AR, DE, FR, IT, JA, KO, RU, TH, and VI.2024.06.04
🌟 GPT-4o achieves the best performance overall, MiniCPM-V2.5 achieves the best performance among open-source models!
| RawData (Google Drive) | Huggingface Dataset
The test code for evaluating models in the paper can be found in scripts.
If you want to add your results to the MTVQA leaderboard, feel free to email us directly at tangjingqun@bytedance.com , haoliu.0128@bytedance.com or can.huang@bytedance.com.
Models | Open-Source | AR | DE | FR | IT | JA | KO | RU | TH | VI | Average |
---|---|---|---|---|---|---|---|---|---|---|---|
InternVL2.5 78B🥇 | ✅ | - | - | - | - | - | - | - | - | - | 31.9 |
Qwen2-VL 72B🥈 | ✅ | 20.7 | 36.5 | 44.1 | 42.8 | 21.6 | 37.4 | 15.6 | 17.7 | 41.6 | 30.9 |
GPT-4o 🥉 | ✘ | 20.2 | 34.2 | 41.2 | 32.7 | 20.0 | 33.9 | 11.5 | 22.5 | 34.2 | 27.8 |
BlueLM-V-3B | ✘ | 17.3 | 39.5 | 44.7 | 32.2 | 23.5 | 34.0 | 9.2 | 20.3 | 22.9 | 27.0 |
Claude3 Opus | ✘ | 15.1 | 33.4 | 40.6 | 34.4 | 19.4 | 27.2 | 13.0 | 19.5 | 29.1 | 25.7 |
Qwen2-VL 7B | ✅ | 15.5 | 32.1 | 41.6 | 38.9 | 17.8 | 30.6 | 13.0 | 10.8 | 30.0 | 25.6 |
GPT-4o mini | ✘ | 16.9 | 33.0 | 41.2 | 32.1 | 18.5 | 27.4 | 11.5 | 19.9 | 29.1 | 25.5 |
Gemini Ultra | ✘ | 14.7 | 32.3 | 40.0 | 31.8 | 12.3 | 17.2 | 11.8 | 20.3 | 28.6 | 23.2 |
InternVL2 76B | ✅ | 9.5 | 31.3 | 35.7 | 35.2 | 11.1 | 14.3 | 11.9 | 10.0 | 26.9 | 22.0 |
GPT-4V | ✘ | 11.5 | 31.5 | 40.4 | 32.3 | 11.5 | 16.7 | 10.3 | 15.0 | 28.9 | 22.0 |
QwenVL Max | ✘ | 7.7 | 31.4 | 37.6 | 30.2 | 18.6 | 25.4 | 10.4 | 4.8 | 23.5 | 21.1 |
Claude3 Sonnet | ✘ | 10.5 | 28.9 | 35.6 | 31.8 | 13.9 | 22.2 | 11.0 | 15.2 | 20.8 | 21.1 |
QwenVL Plus | ✘ | 4.8 | 28.8 | 33.7 | 27.1 | 12.8 | 19.9 | 9.4 | 5.6 | 18.1 | 17.8 |
MiniCPM-V2.5 | ✅ | 6.1 | 29.6 | 35.7 | 26.0 | 12.1 | 13.1 | 5.7 | 12.6 | 15.3 | 17.3 |
InternVL-V1.5 | ✅ | 3.4 | 27.1 | 31.4 | 27.1 | 9.9 | 9.0 | 4.9 | 8.7 | 12.4 | 14.9 |
GLM4V | ✘ | 0.3 | 30.0 | 34.1 | 30.1 | 3.4 | 5.7 | 3.0 | 3.5 | 12.3 | 13.6 |
TextSquare | ✅ | 3.7 | 27.0 | 30.8 | 26.7 | 3.2 | 7.2 | 6.7 | 5.2 | 12.4 | 13.6 |
Mini-Gemini-HD-34B | ✅ | 2.2 | 25.0 | 29.2 | 25.5 | 6.1 | 8.6 | 4.1 | 4.3 | 11.8 | 13.0 |
InternLM-Xcomposer2-4KHD | ✅ | 2.0 | 20.6 | 23.2 | 21.6 | 5.6 | 7.7 | 4.1 | 6.1 | 10.1 | 11.2 |
Llava-Next-34B | ✅ | 3.3 | 24.0 | 28.0 | 22.3 | 3.6 | 6.1 | 2.6 | 0.4 | 9.8 | 11.1 |
TextMonkey | ✅ | 2.0 | 18.1 | 19.9 | 22.1 | 4.6 | 7.2 | 3.2 | 0.9 | 11.1 | 9.9 |
MiniCPM-V2.0 | ✅ | 1.3 | 12.7 | 14.9 | 17.0 | 3.7 | 5.6 | 2.2 | 2.2 | 6.8 | 7.4 |
mPLUG-DocOwl 1.5 | ✅ | 1.0 | 13.9 | 14.9 | 18.2 | 2.9 | 5.0 | 2.0 | 0.9 | 6.4 | 7.2 |
YI-VL-34B | ✅ | 1.7 | 13.5 | 15.7 | 12.1 | 4.8 | 5.2 | 0.8 | 3.5 | 4.1 | 6.8 |
DeepSeek-VL | ✅ | 0.6 | 14.2 | 15.3 | 15.2 | 2.9 | 3.8 | 1.6 | 0.9 | 5.2 | 6.6 |
If you wish to refer to the baseline results published here, please use the following BibTeX entries:
@misc{tang2024mtvqa,
title={MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering},
author={Jingqun Tang and Qi Liu and Yongjie Ye and Jinghui Lu and Shu Wei and Chunhui Lin and Wanqing Li and Mohamad Fitri Faiz Bin Mahmood and Hao Feng and Zhen Zhao and Yanjie Wang and Yuliang Liu and Hao Liu and Xiang Bai and Can Huang},
year={2024},
eprint={2405.11985},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
Your access to and use of this dataset are at your own risk. We do not guarantee the accuracy of this dataset. The dataset is provided “as is” and we make no warranty or representation to you with respect to it and we expressly disclaim, and hereby expressly waive, all warranties, express, implied, statutory or otherwise. This includes, without limitation, warranties of quality, performance, merchantability or fitness for a particular purpose, non-infringement, absence of latent or other defects, accuracy, or the presence or absence of errors, whether or not known or discoverable. In no event will we be liable to you on any legal theory (including, without limitation, negligence) or otherwise for any direct, special, indirect, incidental, consequential, punitive, exemplary, or other losses, costs, expenses, or damages arising out of this public license or use of the licensed material. The disclaimer of warranties and limitation of liability provided above shall be interpreted in a manner that, to the extent possible, most closely approximates an absolute disclaimer and waiver of all liability.