GitHub - im-ajaymeena/obelics-dataset

Overview

This repository have scripts to download obelics dataset in webdataset format. It supports asynchronous fetching of image urls, multi-threading for saving image files as object based storage and .tar format. This script works on shard level, obelics dataset has 1440 shards, it store each shard data as single .tar file.

How to use

Follow these steps

Download the obelics-dataset to some specific directory: with download_obelics.py (update cache_dir)
Two ways to download shard files:
- Pass shard file as arg to save_as_objectstorage.py
- Download multiple shards using run_multi_shard.sh update regular expressions to match shard filename

`.tar` format

Content of each tar file

dataset-000000.tar
├── 000001.json           # Metadata with text and image references for sample 1
├── 000001_1.jpg          # First image in sample 1
├── 000001_2.jpg          # Second image in sample 1
├── 000002.json           # Metadata with text and image references for sample 2
├── 000002_1.jpg          # First image in sample 2
├── 000002_2.jpg          # Second image in sample 2
├── 000002_3.jpg          # Third image in sample 2
└── ...                   # More samples

Content of above json file

[
    {
        "image": "00000_1.jpg"
    },
    {
        "text": "some text1 with same order as in dataset"
    },
    {
        "image": "00000_2.jpg"
    },
    {
        "image": "00000_3.jpg"
    },
    {
        "text": "some text2 with same order as in dataset"
    }
]

Benchmarks

I was able to download single shard file in less than 30min, depending on hardware and network-bandwidth, shard files can be downloaded parallely with run_multi_shard.sh

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
with-golang		with-golang
README.md		README.md
download_obelics.py		download_obelics.py
huggingface_dataset.ipynb		huggingface_dataset.ipynb
run_multi_shard.sh		run_multi_shard.sh
save_as_objectstorage.py		save_as_objectstorage.py
save_as_webdataset.py		save_as_webdataset.py
save_as_webdataset_queue.py		save_as_webdataset_queue.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

How to use

`.tar` format

Content of each tar file

Content of above json file

Benchmarks

References

About

Releases

Packages

Languages

im-ajaymeena/obelics-dataset

Folders and files

Latest commit

History

Repository files navigation

Overview

How to use

.tar format

Content of each tar file

Content of above json file

Benchmarks

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

`.tar` format

Packages