Skip to content

Improve error handling for empty directories in make_dataset #3495

Closed
@pmeier

Description

@pmeier

🚀 Feature

Improve error handling for empty directories in make_dataset().

Motivation

datasets.folder.make_dataset() requires the class_to_idx attribute that is then used to collect the instances:

for target_class in sorted(class_to_idx.keys()):
class_index = class_to_idx[target_class]
target_dir = os.path.join(directory, target_class)
if not os.path.isdir(target_dir):
continue
for root, _, fnames in sorted(os.walk(target_dir, followlinks=True)):

Currently, we have four places where make_dataset() is used and in all cases class_to_idx is generated the same:

  1. classes, class_to_idx = self._find_classes(self.root)

    with def _find_classes(dir):

    classes = [d.name for d in os.scandir(dir) if d.is_dir()]
    classes.sort()
    class_to_idx = {cls_name: i for i, cls_name in enumerate(classes)}
    return classes, class_to_idx

  2. classes = sorted(list_dir(root))
    class_to_idx = {class_: i for (i, class_) in enumerate(classes)}
    self.samples = make_dataset(
    self.root,
    class_to_idx,
    extensions,
    )

  3. classes = list(sorted(list_dir(root)))
    class_to_idx = {classes[i]: i for i in range(len(classes))}
    self.samples = make_dataset(self.root, class_to_idx, extensions, is_valid_file=None)

  4. classes = list(sorted(list_dir(root)))
    class_to_idx = {classes[i]: i for i in range(len(classes))}
    self.samples = make_dataset(self.root, class_to_idx, extensions, is_valid_file=None)

Furthermore, only DatasetFolder has a builtin check if make_dataset found any samples:

samples = self.make_dataset(self.root, class_to_idx, extensions, is_valid_file)
if len(samples) == 0:
msg = "Found 0 files in subfolders of: {}\n".format(self.root)
if extensions is not None:
msg += "Supported extensions are: {}".format(",".join(extensions))
raise RuntimeError(msg)

While this is better than passing silently and failing somewhere else (#2903), it still misses the underlying issue in case of an directory without subfolders.

Pitch

I propose three things:

  1. Factor out the implementation of the DatasetFolder._find_classes() method into a find_classes() function similar to what we did with make_dataset in 'make_dataset' as staticmethod of 'DatasetFolder' #3215.
  2. Raise an expressive error in find_classes() if no classes were found.
  3. Make the class_to_idx parameter optional in make_dataset and call find_classes if it is omitted.

With this we are as flexible as before while we remove duplicated code.

  • If one does not want the default behavior, class_to_idx can still be passed explicitly
  • If one needs the returned classes, e.g. the video datasets, a call could look like this
    self.classes, class_to_idx = find_classes(root)
    self.samples = make_dataset(root, class_to_idx, ...)
  • If one only needs the samples calling self.samples = make_dataset(root, ...) is enough

cc @pmeier

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions