Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ImageFolder performs poorly with large datasets #5317

Open
salieri opened this issue Dec 1, 2022 · 3 comments
Open

ImageFolder performs poorly with large datasets #5317

salieri opened this issue Dec 1, 2022 · 3 comments

Comments

@salieri
Copy link

salieri commented Dec 1, 2022

Describe the bug

While testing image dataset creation, I'm seeing significant performance bottlenecks with imagefolders when scanning a directory structure with large number of images.

Setup

  • Nested directories (5 levels deep)
  • 3M+ images
  • 1 metadata.jsonl file

Performance Degradation Point 1

Degradation occurs because get_data_files_patterns runs the exact same scan for many different types of patterns, and there doesn't seem to be a way to easily limit this. It's controlled by the definition of ALL_DEFAULT_PATTERNS.

One scan with 3M+ files takes about 10-15 minutes to complete on my setup, so having those extra scans really slows things down – from 10 minutes to 60+. Most of the scans return no matches, but they still take a significant amount of time to complete – hence the poor performance.

As a side effect, when this scan is run on 3M+ image files, Python also consumes up to 12 GB of RAM, which is not ideal.

Performance Degradation Point 2

The second performance bottleneck is in PackagedDatasetModuleFactory.get_module, which calls DataFilesDict.from_local_or_remote.

It runs for a long time (60min+), consuming significant amounts of RAM – even more than the point 1 above. Based on iostat -d 2, it performs zero disk operations, which to me suggests that there is a code based bottleneck there that could be sorted out.

Steps to reproduce the bug

from datasets import load_dataset
import os
import huggingface_hub

dataset = load_dataset(
  'imagefolder',
  data_dir='/some/path',
  # just to spell it out:
  split=None,
  drop_labels=True,
  keep_in_memory=False
)

dataset.push_to_hub('account/dataset', private=True)

Expected behavior

While it's certainly possible to write a custom loader to replace ImageFolder with, it'd be great if the off-the-shelf ImageFolder would by default have a setup that can scale to large datasets.

Or perhaps there could be a dedicated loader just for large datasets that trades off flexibility for performance? As in, maybe you have to define explicitly how you want it to work rather than it trying to guess your data structure like _get_data_files_patterns() does?

Environment info

  • datasets version: 2.7.1
  • Platform: Linux-4.14.296-222.539.amzn2.x86_64-x86_64-with-glibc2.2.5
  • Python version: 3.7.10
  • PyArrow version: 10.0.1
  • Pandas version: 1.3.5
@salieri salieri changed the title data_files._get_data_files_patterns() performs poorly with large imagefolders ImageFolder performs poorly with large datasets Dec 1, 2022
@lhoestq
Copy link
Member

lhoestq commented Dec 1, 2022

Hi ! ImageFolder is made for small scale datasets indeed. For large scale image datasets you better group your images in TAR archives or Arrow/Parquet files. This is true not just for ImageFolder loading performance, but also because having millions of files is not ideal for your filesystem or when moving the data around.

Option 1. use TAR archives

I'd suggest you to take a look at how we load Imagenet for example. The dataset is sharded in multiple TAR archives and there is a script that iterates over the archives to load the images.

Option 2. use Arrow/Parquet

You can load your images as an Arrow Dataset with

from datasets import Dataset, Image, load_from_disk, load_dataset

ds = Dataset.from_dict({"image": list(glob.glob("path/to/dir/**/*.jpg"))})

def add_metadata(example):
    ...

ds = ds.map(add_metadata, num_proc=16)  # num_proc for multiprocessing
ds = ds.cast_column("image", Image())

# save as Arrow locally
ds.save_to_disk("output_dir")
reloaded = load_from_disk("output_dir")

# OR save as Parquet on the HF Hub
ds.push_to_hub("username/dataset_name")
reloaded = load_dataset("username/dataset_name")
# reloaded = load_dataset("username/dataset_name", num_proc=16)  # to use multiprocessing

PS: maybe we can actually have something similar to ImageFolder but for image archives at one point ?

@salieri
Copy link
Author

salieri commented Dec 1, 2022

@lhoestq Thanks!

Perhaps it'd be worth adding a note on the documentation that ImageFolder is not intended for large datasets? This limitation is not intuitively obvious to someone who has not used it before, I think.

@stevhliu
Copy link
Member

stevhliu commented Dec 1, 2022

Thanks for the feedback @salieri! I opened #5329 to make it clear ImageFolder is not intended for large datasets. Please feel free to comment if you have any other feedback! 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants