This repository have scripts to download obelics dataset in webdataset format. It supports asynchronous fetching of image urls, multi-threading for saving image files as object based storage and .tar
format.
This script works on shard level, obelics dataset has 1440 shards, it store each shard data as single .tar
file.
Follow these steps
- Download the obelics-dataset to some specific directory: with
download_obelics.py
(updatecache_dir
) - Two ways to download shard files:
- Pass shard file as arg to
save_as_objectstorage.py
- Download multiple shards using
run_multi_shard.sh
update regular expressions to match shard filename
- Pass shard file as arg to
dataset-000000.tar
├── 000001.json # Metadata with text and image references for sample 1
├── 000001_1.jpg # First image in sample 1
├── 000001_2.jpg # Second image in sample 1
├── 000002.json # Metadata with text and image references for sample 2
├── 000002_1.jpg # First image in sample 2
├── 000002_2.jpg # Second image in sample 2
├── 000002_3.jpg # Third image in sample 2
└── ... # More samples
[
{
"image": "00000_1.jpg"
},
{
"text": "some text1 with same order as in dataset"
},
{
"image": "00000_2.jpg"
},
{
"image": "00000_3.jpg"
},
{
"text": "some text2 with same order as in dataset"
}
]
I was able to download single shard file in less than 30min, depending on hardware and network-bandwidth, shard files can be downloaded parallely with run_multi_shard.sh