We have created a Japanese visual question answering (VQA) dataset by using Yahoo! Crowdsourcing, based on the images from the Visual Genome dataset. Our dataset is meant to be comparable to the freeform QA part of Visual Genome dataset. The dataset consists of 99,208 images, together with 793,664 QA pairs in Japanese with every image having eight QA pairs.
The annotations are stored in a single JSON file. The data format is a subset of Visual Genome dataset v1.2.
Creative Commons Attribution 4.0 License
@InProceedings{C18-1163,
author = "Shimizu, Nobuyuki
and Rong, Na
and Miyazaki, Takashi",
title = "Visual Question Answering Dataset for Bilingual Image Understanding: A Study of Cross-Lingual Transfer Using Attention Maps",
booktitle = "Proceedings of the 27th International Conference on Computational Linguistics",
year = "2018",
publisher = "Association for Computational Linguistics",
pages = "1918--1928",
location = "Santa Fe, New Mexico, USA",
url = "http://aclweb.org/anthology/C18-1163"
}