Skip to content

im-ajaymeena/obelics-dataset

Repository files navigation

Overview

This repository have scripts to download obelics dataset in webdataset format. It supports asynchronous fetching of image urls, multi-threading for saving image files as object based storage and .tar format. This script works on shard level, obelics dataset has 1440 shards, it store each shard data as single .tar file.

How to use

Follow these steps

  • Download the obelics-dataset to some specific directory: with download_obelics.py (update cache_dir)
  • Two ways to download shard files:
    • Pass shard file as arg to save_as_objectstorage.py
    • Download multiple shards using run_multi_shard.sh update regular expressions to match shard filename

.tar format

Content of each tar file

dataset-000000.tar
├── 000001.json           # Metadata with text and image references for sample 1
├── 000001_1.jpg          # First image in sample 1
├── 000001_2.jpg          # Second image in sample 1
├── 000002.json           # Metadata with text and image references for sample 2
├── 000002_1.jpg          # First image in sample 2
├── 000002_2.jpg          # Second image in sample 2
├── 000002_3.jpg          # Third image in sample 2
└── ...                   # More samples

Content of above json file

[
    {
        "image": "00000_1.jpg"
    },
    {
        "text": "some text1 with same order as in dataset"
    },
    {
        "image": "00000_2.jpg"
    },
    {
        "image": "00000_3.jpg"
    },
    {
        "text": "some text2 with same order as in dataset"
    }
]

Benchmarks

I was able to download single shard file in less than 30min, depending on hardware and network-bandwidth, shard files can be downloaded parallely with run_multi_shard.sh

References

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published