`ImageFolder` performs poorly with large datasets #5317

salieri · 2022-12-01T00:04:21Z

Describe the bug

While testing image dataset creation, I'm seeing significant performance bottlenecks with imagefolders when scanning a directory structure with large number of images.

Setup

Nested directories (5 levels deep)
3M+ images
1 metadata.jsonl file

Performance Degradation Point 1

Degradation occurs because get_data_files_patterns runs the exact same scan for many different types of patterns, and there doesn't seem to be a way to easily limit this. It's controlled by the definition of ALL_DEFAULT_PATTERNS.

One scan with 3M+ files takes about 10-15 minutes to complete on my setup, so having those extra scans really slows things down – from 10 minutes to 60+. Most of the scans return no matches, but they still take a significant amount of time to complete – hence the poor performance.

As a side effect, when this scan is run on 3M+ image files, Python also consumes up to 12 GB of RAM, which is not ideal.

Performance Degradation Point 2

The second performance bottleneck is in PackagedDatasetModuleFactory.get_module, which calls DataFilesDict.from_local_or_remote.

It runs for a long time (60min+), consuming significant amounts of RAM – even more than the point 1 above. Based on iostat -d 2, it performs zero disk operations, which to me suggests that there is a code based bottleneck there that could be sorted out.

Steps to reproduce the bug

from datasets import load_dataset
import os
import huggingface_hub

dataset = load_dataset(
  'imagefolder',
  data_dir='/some/path',
  # just to spell it out:
  split=None,
  drop_labels=True,
  keep_in_memory=False
)

dataset.push_to_hub('account/dataset', private=True)

Expected behavior

While it's certainly possible to write a custom loader to replace ImageFolder with, it'd be great if the off-the-shelf ImageFolder would by default have a setup that can scale to large datasets.

Or perhaps there could be a dedicated loader just for large datasets that trades off flexibility for performance? As in, maybe you have to define explicitly how you want it to work rather than it trying to guess your data structure like _get_data_files_patterns() does?

Environment info

datasets version: 2.7.1
Platform: Linux-4.14.296-222.539.amzn2.x86_64-x86_64-with-glibc2.2.5
Python version: 3.7.10
PyArrow version: 10.0.1
Pandas version: 1.3.5

The text was updated successfully, but these errors were encountered:

lhoestq · 2022-12-01T13:14:27Z

Hi ! ImageFolder is made for small scale datasets indeed. For large scale image datasets you better group your images in TAR archives or Arrow/Parquet files. This is true not just for ImageFolder loading performance, but also because having millions of files is not ideal for your filesystem or when moving the data around.

Option 1. use TAR archives

I'd suggest you to take a look at how we load Imagenet for example. The dataset is sharded in multiple TAR archives and there is a script that iterates over the archives to load the images.

Option 2. use Arrow/Parquet

You can load your images as an Arrow Dataset with

from datasets import Dataset, Image, load_from_disk, load_dataset

ds = Dataset.from_dict({"image": list(glob.glob("path/to/dir/**/*.jpg"))})

def add_metadata(example):
    ...

ds = ds.map(add_metadata, num_proc=16)  # num_proc for multiprocessing
ds = ds.cast_column("image", Image())

# save as Arrow locally
ds.save_to_disk("output_dir")
reloaded = load_from_disk("output_dir")

# OR save as Parquet on the HF Hub
ds.push_to_hub("username/dataset_name")
reloaded = load_dataset("username/dataset_name")
# reloaded = load_dataset("username/dataset_name", num_proc=16)  # to use multiprocessing

PS: maybe we can actually have something similar to ImageFolder but for image archives at one point ?

salieri · 2022-12-01T17:26:16Z

@lhoestq Thanks!

Perhaps it'd be worth adding a note on the documentation that ImageFolder is not intended for large datasets? This limitation is not intuitively obvious to someone who has not used it before, I think.

stevhliu · 2022-12-01T21:49:26Z

Thanks for the feedback @salieri! I opened #5329 to make it clear ImageFolder is not intended for large datasets. Please feel free to comment if you have any other feedback! 🙂

salieri changed the title ~~data_files._get_data_files_patterns() performs poorly with large imagefolders~~ ImageFolder performs poorly with large datasets Dec 1, 2022

stevhliu mentioned this issue Dec 1, 2022

Clarify imagefolder is for small datasets #5329

Merged

anton-l mentioned this issue Dec 14, 2022

Using local dataset but blocked at load_dataset huggingface/diffusers#1703

Closed

Mddct mentioned this issue Nov 16, 2023

中文开源语音大模型计划 wenet-e2e/wenet#2097

Open

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`ImageFolder` performs poorly with large datasets #5317

`ImageFolder` performs poorly with large datasets #5317

salieri commented Dec 1, 2022

lhoestq commented Dec 1, 2022 •

edited

Loading

salieri commented Dec 1, 2022

stevhliu commented Dec 1, 2022

ImageFolder performs poorly with large datasets #5317

ImageFolder performs poorly with large datasets #5317

Comments

salieri commented Dec 1, 2022

Describe the bug

Setup

Performance Degradation Point 1

Performance Degradation Point 2

Steps to reproduce the bug

Expected behavior

Environment info

lhoestq commented Dec 1, 2022 • edited Loading

salieri commented Dec 1, 2022

stevhliu commented Dec 1, 2022

`ImageFolder` performs poorly with large datasets #5317

`ImageFolder` performs poorly with large datasets #5317

lhoestq commented Dec 1, 2022 •

edited

Loading