Detection for nested folders #623

sizov-kirill · 2022-01-12T10:30:33Z

Summary

Datasets are often contains in archives and users usually extract them in a separate directory. This leads to the fact that the dataset can be stored in subfolders, so this PR resolved problem when Datumaro cannot detect dataset format with such folder structure.

How to test

Checklist

I submit my changes into the develop branch
I have added description of my changes into CHANGELOG
I have updated the documentation accordingly
I have added tests to cover my changes
I have linked related issues

License

I submit my code changes under the same MIT License that covers the project.
Feel free to contact the maintainers if that's a concern.
I have updated the license header for each file (see an example below)

# Copyright (C) 2021 Intel Corporation
#
# SPDX-License-Identifier: MIT

…ly one exists

datumaro/components/dataset.py

zhiltsov-max

Please also add a test and an error, if the directory does not exist

datumaro/components/dataset.py

zhiltsov-max · 2022-01-17T10:44:31Z

tests/test_dataset.py

+    def test_can_detect_with_nested_folder(self):
+        env = Environment()
+        env.importers.items = {DEFAULT_FORMAT: env.importers[DEFAULT_FORMAT]}
+        env.extractors.items = {DEFAULT_FORMAT: env.extractors[DEFAULT_FORMAT]}


The test is generally fine, but consider extending it with a generic "fits everything" format (like image dir) to test for collisions and that we continue search till we find one match. Maybe, it should be in another test.

It's look that such situation when we have more than one match and continue search till we find one match, will occurs never. Because we continue search if our path has only one file system object and this object it's a directory, but it for this path we have more than one match it's mean that this path most likely contains more than one file system object, so that these two situations are contradictory.

For example, we could detect "image_dir" and "imagenet" in any root image directory, but only if we go deeper, we we'll detect a COCO dataset, for example. I know it is implemented already, just asking for a test that ensures this logic works.

Example:

a/annotations/x_instances.json a/images/y.jpg ^ here we detect few general matching formats (currently, imagedir, imagenet, celeba, align_celeba, cifar etc.)

And inside a we detect COCO.

Added this test.

zhiltsov-max · 2022-01-17T10:45:12Z

datumaro/components/dataset.py

@@ -911,6 +911,9 @@ def detect(path: str, env: Optional[Environment] = None,
        if env is None:
            env = Environment()

+        if not osp.exists(path):
+            raise FileNotFoundError(path)


Consider moving it to Environment.detect_dataset()

I moved it, but it looks like it will affect the #576 a bit.

kirill.sizov added 3 commits January 12, 2022 13:23

Add enhancement: try to detect dataset format in a subdirectory if on…

7c77733

…ly one exists

Sort imports

c7bc96d

Merge branch 'develop' into sk/detection-for-nested-folders

dae2643

sizov-kirill changed the title ~~Detection for nested folders~~ [WIP] Detection for nested folders Jan 12, 2022

Parse revision path first

ddddaed

sizov-kirill changed the title ~~[WIP] Detection for nested folders~~ Detection for nested folders Jan 12, 2022

sizov-kirill requested a review from zhiltsov-max January 13, 2022 06:00

zhiltsov-max reviewed Jan 13, 2022

View reviewed changes

datumaro/components/dataset.py Outdated Show resolved Hide resolved

Ignore OS specific files

210dddd

zhiltsov-max reviewed Jan 14, 2022

View reviewed changes

datumaro/components/dataset.py Show resolved Hide resolved

sizov-kirill changed the title ~~Detection for nested folders~~ [WIP] Detection for nested folders Jan 14, 2022

zhiltsov-max mentioned this pull request Jan 14, 2022

Check for an extra leading directory in the uploaded dataset archives cvat-ai/cvat#3849

Open

2 tasks

kirill.sizov added 2 commits January 17, 2022 12:00

Add tests

d329807

Raise exception for non-existent path

c394103

sizov-kirill changed the title ~~[WIP] Detection for nested folders~~ Detection for nested folders Jan 17, 2022

zhiltsov-max reviewed Jan 17, 2022

View reviewed changes

kirill.sizov added 4 commits January 18, 2022 12:17

Raise exception when depth has negative value

3b4ebd5

Add test

10b12aa

Move exception

11125d9

Fix test

a50104e

zhiltsov-max approved these changes Jan 18, 2022

View reviewed changes

zhiltsov-max merged commit 3617dee into develop Jan 18, 2022

zhiltsov-max deleted the sk/detection-for-nested-folders branch January 18, 2022 10:43

sizov-kirill mentioned this pull request Mar 9, 2022

Detection for nested folders (CLI) #680

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detection for nested folders #623

Detection for nested folders #623

sizov-kirill commented Jan 12, 2022

zhiltsov-max left a comment

zhiltsov-max Jan 17, 2022

sizov-kirill Jan 17, 2022

zhiltsov-max Jan 17, 2022

sizov-kirill Jan 18, 2022

zhiltsov-max Jan 17, 2022

sizov-kirill Jan 18, 2022

Detection for nested folders #623

Detection for nested folders #623

Conversation

sizov-kirill commented Jan 12, 2022

Summary

How to test

Checklist

License

zhiltsov-max left a comment

Choose a reason for hiding this comment

zhiltsov-max Jan 17, 2022

Choose a reason for hiding this comment

sizov-kirill Jan 17, 2022

Choose a reason for hiding this comment

zhiltsov-max Jan 17, 2022

Choose a reason for hiding this comment

sizov-kirill Jan 18, 2022

Choose a reason for hiding this comment

zhiltsov-max Jan 17, 2022

Choose a reason for hiding this comment

sizov-kirill Jan 18, 2022

Choose a reason for hiding this comment