A dataset of crowdsourced ratings for machine-generated image captions.
Image Captioning models can automatically generate natural language descriptions for input images. To assess the quality of these model, researchers conduct human evaluations where raters are asked to judge the quality of model-generated captions for previously unseen images.
In this dataset, we provide a compilation of human ratings obtained for thousands of images and corresponding machine-generated captions that we have collected over the years from our evaluations. See Google Crowdsource for our evaluation setup.
Though human evaluation setup works well during development, it cannot be used to assess the quality of a caption in realtime when a model is deployed to serve live production traffic. This dataset can be used to build an automatic Quality Evaluation (QE) model for Image Captioning to estimate the quality.
The dataset can be downloaded as a zip of tab-separated-values (TSV) files for train/dev/test split and corresponding image metadata. Images have been sampled from the Open Images Dataset. Please see the metadata files or the Open Images website to download the image files.
Released: August 2019
Fold | Samples | Unique Images |
Train | 58,354 | 11,027 |
Dev | 2,392 | 6,54 |
Test | 4,592 | 1,237 |
Released: September 2019
Fold | Samples | Unique Images |
Train | 129,759 | 28,525 |
Dev | 7,151 | 3,444 |
Test | 7,135 | 3,442 |
Released: June 2019
During the Conceptual Captions Challenge Workshop at CVPR 2019, we released a human ratings dataset for image captions called the T2 Dataset. This dataset has ratings for the top 5 models in the challenge (Leaderboard). The images in this set are disjoint from the images in all other versions above, and our recommendation is to use this as a test set for all versions of the Image Caption Quality Dataset.
See the README.txt file in the downloaded zip file for the version.
If you use this dataset in your research, please cite our paper:
@article{icqd2019,
title={Quality Estimation for Image Captions Based on Large-scale Human Evaluations},
author={T. Levinboim, A. Thapliyal, P. Sharma, and R. Soricut},
journal={arXiv preprint arXiv:1909.03396},
year={2019}
}
If you have a technical question regarding the dataset or publication, please create an issue in this repository. This is the fastest way to reach us.
If you would like to share feedback or report concerns regarding the data, please see OWNERS file for our contact information.