Datasets

We provide links to download our preprocessed dataset. If you would like to process the data on your own, we will soon provide scripts for you to do so.

Pretraining

A small subset of the pretraining data

The pretraining datasets used in OFA are all publicly available. Here we provide the public links to these data, it is recommended that you download the data from the links first, and then process the downloaded dataset into a similar format as the examples we provided.

CC12M: https://github.com/google-research-datasets/conceptual-12m
CC3M: https://github.com/google-research-datasets/conceptual-captions
SBU: https://www.cs.virginia.edu/~vicente/sbucaptions
COCO: https://cocodataset.org/#home
VG: https://visualgenome.org/
VQAv2: https://visualqa.org/
GQA: https://cs.stanford.edu/people/dorarad/gqa/about.html
RefCOCO/RefCOCO+/RefCOCOg: https://github.com/lichengunc/refer
OpenImages: https://storage.googleapis.com/openimages/web/index.html
Object365: https://www.objects365.org/overview.html
YFCC100M (subset): https://github.com/openai/CLIP/blob/main/data/yfcc100m.md
ImageNet-21K: https://image-net.org/index.php
Pile: https://pile.eleuther.ai

Vision & Language Tasks

Dataset for Caption
Dataset for RefCOCO
Dataset for RefCOCO+
Dataset for RefCOCOg
Dataset for VQAv2 (we have also provided chunked parts of the dataset files for more convenient downloading, please refer to issue #68)
Dataset for SNLI-VE
Dataset for Text-to-Image Genearion
Dataset for Text-to-Image Genearion (with original id)

Vision Tasks

Dataset for ImageNet-1K

Language Tasks

Dataset for COLA
Dataset for MNLI
Dataset for MRPC
Dataset for QNLI
Dataset for QQP
Dataset for RTE
Dataset for SST2
Dataset for Gigaword

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datasets.md

datasets.md

Datasets

Pretraining

Vision & Language Tasks

Vision Tasks

Language Tasks

Files

datasets.md

Latest commit

History

datasets.md

File metadata and controls

Datasets

Pretraining

Vision & Language Tasks

Vision Tasks

Language Tasks