Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detection for nested folders #623

Merged
merged 11 commits into from
Jan 18, 2022
Merged

Conversation

sizov-kirill
Copy link

Summary

Datasets are often contains in archives and users usually extract them in a separate directory. This leads to the fact that the dataset can be stored in subfolders, so this PR resolved problem when Datumaro cannot detect dataset format with such folder structure.

How to test

Checklist

License

  • I submit my code changes under the same MIT License that covers the project.
    Feel free to contact the maintainers if that's a concern.
  • I have updated the license header for each file (see an example below)
# Copyright (C) 2021 Intel Corporation
#
# SPDX-License-Identifier: MIT

@sizov-kirill sizov-kirill changed the title Detection for nested folders [WIP] Detection for nested folders Jan 12, 2022
@sizov-kirill sizov-kirill changed the title [WIP] Detection for nested folders Detection for nested folders Jan 12, 2022
Copy link
Contributor

@zhiltsov-max zhiltsov-max left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add a test and an error, if the directory does not exist

datumaro/components/dataset.py Show resolved Hide resolved
@sizov-kirill sizov-kirill changed the title Detection for nested folders [WIP] Detection for nested folders Jan 14, 2022
@sizov-kirill sizov-kirill changed the title [WIP] Detection for nested folders Detection for nested folders Jan 17, 2022
def test_can_detect_with_nested_folder(self):
env = Environment()
env.importers.items = {DEFAULT_FORMAT: env.importers[DEFAULT_FORMAT]}
env.extractors.items = {DEFAULT_FORMAT: env.extractors[DEFAULT_FORMAT]}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test is generally fine, but consider extending it with a generic "fits everything" format (like image dir) to test for collisions and that we continue search till we find one match. Maybe, it should be in another test.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's look that such situation when we have more than one match and continue search till we find one match, will occurs never. Because we continue search if our path has only one file system object and this object it's a directory, but it for this path we have more than one match it's mean that this path most likely contains more than one file system object, so that these two situations are contradictory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, we could detect "image_dir" and "imagenet" in any root image directory, but only if we go deeper, we we'll detect a COCO dataset, for example. I know it is implemented already, just asking for a test that ensures this logic works.

Example:

a/annotations/x_instances.json
a/images/y.jpg
^
here we detect few general matching formats (currently, imagedir, imagenet, celeba, align_celeba, cifar etc.)

And inside a we detect COCO.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this test.

@@ -911,6 +911,9 @@ def detect(path: str, env: Optional[Environment] = None,
if env is None:
env = Environment()

if not osp.exists(path):
raise FileNotFoundError(path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider moving it to Environment.detect_dataset()

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved it, but it looks like it will affect the #576 a bit.

@zhiltsov-max zhiltsov-max merged commit 3617dee into develop Jan 18, 2022
@zhiltsov-max zhiltsov-max deleted the sk/detection-for-nested-folders branch January 18, 2022 10:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants