SBU Captions is a large-scale dataset that contains 860K image-text pairs as well as many other meta-attributes to increase the usability to train various models. This dataset is one of the key benchmark datasets.
wget https://www.cs.rice.edu/~vo9/sbucaptions/sbu-captions-all.tar.gz
tar -xvzf sbu-captions-all.tar.gz
img2dataset --url_list sbu-captions-all.json --input_format "json" --url_col "image_urls" --caption_col "captions" --output_format webdataset --output_folder sbucaptions --processes_count 16 --thread_count 64 --image_size 256
https://wandb.ai/rom1504/img2dataset/runs/2nhepsmf
1000 sample/s using 16 cores Average bandwidth 500Mb/s ; cpu usage 100% on all cores Write speed on disk : about 20MB/s average