-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ImageFolder
performs poorly with large datasets
#5317
Comments
data_files._get_data_files_patterns()
performs poorly with large imagefoldersImageFolder
performs poorly with large datasets
Hi ! ImageFolder is made for small scale datasets indeed. For large scale image datasets you better group your images in TAR archives or Arrow/Parquet files. This is true not just for ImageFolder loading performance, but also because having millions of files is not ideal for your filesystem or when moving the data around. Option 1. use TAR archives I'd suggest you to take a look at how we load Imagenet for example. The dataset is sharded in multiple TAR archives and there is a script that iterates over the archives to load the images. Option 2. use Arrow/Parquet You can load your images as an Arrow Dataset with from datasets import Dataset, Image, load_from_disk, load_dataset
ds = Dataset.from_dict({"image": list(glob.glob("path/to/dir/**/*.jpg"))})
def add_metadata(example):
...
ds = ds.map(add_metadata, num_proc=16) # num_proc for multiprocessing
ds = ds.cast_column("image", Image())
# save as Arrow locally
ds.save_to_disk("output_dir")
reloaded = load_from_disk("output_dir")
# OR save as Parquet on the HF Hub
ds.push_to_hub("username/dataset_name")
reloaded = load_dataset("username/dataset_name")
# reloaded = load_dataset("username/dataset_name", num_proc=16) # to use multiprocessing PS: maybe we can actually have something similar to ImageFolder but for image archives at one point ? |
@lhoestq Thanks! Perhaps it'd be worth adding a note on the documentation that |
Describe the bug
While testing image dataset creation, I'm seeing significant performance bottlenecks with imagefolders when scanning a directory structure with large number of images.
Setup
metadata.jsonl
filePerformance Degradation Point 1
Degradation occurs because
get_data_files_patterns
runs the exact same scan for many different types of patterns, and there doesn't seem to be a way to easily limit this. It's controlled by the definition ofALL_DEFAULT_PATTERNS
.One scan with 3M+ files takes about 10-15 minutes to complete on my setup, so having those extra scans really slows things down – from 10 minutes to 60+. Most of the scans return no matches, but they still take a significant amount of time to complete – hence the poor performance.
As a side effect, when this scan is run on 3M+ image files, Python also consumes up to 12 GB of RAM, which is not ideal.
Performance Degradation Point 2
The second performance bottleneck is in
PackagedDatasetModuleFactory.get_module
, which callsDataFilesDict.from_local_or_remote
.It runs for a long time (60min+), consuming significant amounts of RAM – even more than the point 1 above. Based on
iostat -d 2
, it performs zero disk operations, which to me suggests that there is a code based bottleneck there that could be sorted out.Steps to reproduce the bug
Expected behavior
While it's certainly possible to write a custom loader to replace
ImageFolder
with, it'd be great if the off-the-shelfImageFolder
would by default have a setup that can scale to large datasets.Or perhaps there could be a dedicated loader just for large datasets that trades off flexibility for performance? As in, maybe you have to define explicitly how you want it to work rather than it trying to guess your data structure like
_get_data_files_patterns()
does?Environment info
datasets
version: 2.7.1The text was updated successfully, but these errors were encountered: