Since the host server of EST-VQA dataset is no longer available, we provide the download link of the dataset in this repository.
We also release the test annotation here, so you don't have to use the EvalAI for evaluation now.
Google Drive: [Images Train] [Images Test] [Annotations Train] [Annotations Test]
Baidu Netdisk: [Images](code: dcmn) [Annotations](code:e4qe)
You can use eval.py
to evaluate your model on EST-VQA dataset. Simply convert your prediction file to the same format as pred_sample.json
and run the following command:
python eval.py --pred_file PATH_TO_PRED --gt_file PATH_TO_GT
Part of the results is borrowed from this paper.
Year | Venue | Model | LLM-based | EST-VQA (En) | EST-VQA (CN) | Overall |
---|---|---|---|---|---|---|
2023 | ICML | BLIP2-OPT-6.7B | Y | 40.7 | 0 | |
2023 | NeurIPS | InstructBlip | Y | 48.6 | 0.1 | |
2023 | arxiv | mPlug-Owl | Y | 52.7 | 0 | |
2023 | arxiv | LLaVAR | Y | 58.2 | 0 | |
2023 | NeurIPS | LLaVA-1.5-7B | Y | 52.3 | 0 | |
2024 | AAAI | BLIVA | Y | 51.2 | 0.2 | |
2024 | CVPR | mPLUG-Owl2 | Y | 68.6 | 4.9 | |
2024 | CVPR | Monkey | Y | 71 | 42.6 |
If you found EST-VQA useful in your research, please kindly cite using the following BibTeX:
@inproceedings{wang2020general,
title={On the general value of evidence, and bilingual scene-text visual question answering},
author={Wang, Xinyu and Liu, Yuliang and Shen, Chunhua and Ng, Chun Chet and Luo, Canjie and Jin, Lianwen and Chan, Chee Seng and Hengel, Anton van den and Wang, Liangwei},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={10126--10135},
year={2020}
}