DataComp-1B is a dataset with 1.4 billion image-text pairs collected from Common Crawl and subsequently filtered. DataComp-1B is derived from CommonPool, as part of DataComp, a benchmark for designing multimodal datasets.
DataComp-1B comprises the best performing subset of the xlarge
version of CommonPool found by Gadre et al., 2023.
See http://datacomp.ai/ and https://arxiv.org/abs/2304.14108 for details.
CommonPool can be downloaded using img2dataset by following the instructions on https://github.com/mlfoundations/datacomp/tree/main#downloading-datacomp-1b