Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using local dataset but blocked at load_dataset #1703

Closed
BrandonHanx opened this issue Dec 14, 2022 · 4 comments
Closed

Using local dataset but blocked at load_dataset #1703

BrandonHanx opened this issue Dec 14, 2022 · 4 comments

Comments

@BrandonHanx
Copy link

BrandonHanx commented Dec 14, 2022

Hi, @pcuenca @patil-suraj @anton-l
I'm trying to fully fine-tune SD u-net on my own dataset, including about 1M image-text pairs.
I'm following the script in examples/text_to_image.

However, it's blocked at load_dataset with very slow processing. It has been 12 hrs but the preparation of these 1M data is still not ready.

data_files = {}
if args.train_data_dir is not None:
data_files["train"] = os.path.join(args.train_data_dir, "**")
dataset = load_dataset(
"imagefolder",
data_files=data_files,
cache_dir=args.cache_dir,
)

I tried another smaller dataset with only thousands of image-text pairs and everything worked fine.
So I was just wondering if this slow-loading process is expected for large datasets.

OS: Ubuntu-20.04, GPU: 32GB V100 x 8, dependencies: according to the current installation instructions

@haofanwang
Copy link
Contributor

For large scale dataset, it is better to follow open_clip.

@BrandonHanx
Copy link
Author

For large scale dataset, it is better to follow open_clip.

Hi @haofanwang Thanks for your reply. Does the huggingface dataset also suffer from loading officially supported datasets, like LAION? I think my slow processing is not expected.

@anton-l
Copy link
Member

anton-l commented Dec 14, 2022

Hi @BrandonHanx! At the moment imagefolder becomes suboptimal when used with more than a couple thousands of images. Quoting @lhoestq's reply from huggingface/datasets#5317 here:

For large scale image datasets you better group your images in TAR archives or Arrow/Parquet files. This is true not just for ImageFolder loading performance, but also because having millions of files is not ideal for your filesystem or when moving the data around.

Option 1. use TAR archives

I'd suggest you to take a look at how we load Imagenet for example. The dataset is sharded in multiple TAR archives and there is a script that iterates over the archives to load the images.

Option 2. use Arrow/Parquet

You can load your images as an Arrow Dataset with

from datasets import Dataset, Image, load_from_disk, load_dataset

ds = Dataset.from_dict({"image": list(glob.glob("path/to/dir/**/*.jpg"))})

def add_metadata(example):
    ...

ds = ds.map(add_metadata, num_proc=16)  # num_proc for multiprocessing
ds = ds.cast_column("image", Image())

# save as Arrow locally
ds.save_to_disk("output_dir")
reloaded = load_from_disk("output_dir")

# OR save as Parquet on the HF Hub
ds.push_to_hub("username/dataset_name")
reloaded = load_dataset("username/dataset_name")
# reloaded = load_dataset("username/dataset_name", num_proc=16)  # to use multiprocessing

@BrandonHanx
Copy link
Author

Hi @anton-l
Thank you very much for your reply.
As mentioned above, I tried to save my dataset as Arrow and managed to run it.

Now I'm closing this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants