Skip to content

Commit

Permalink
Fix validator and add notebooks and document for level-up validator (#…
Browse files Browse the repository at this point in the history
…933)

<!-- Contributing guide:
https://github.com/openvinotoolkit/datumaro/blob/develop/CONTRIBUTING.md
-->

### Summary

<!--
Resolves #111 and #222.
Depends on #1000 (for series of dependent commits).

This PR introduces this capability to make the project better in this
and that.

- Added this feature
- Removed that feature
- Fixed the problem #1234
-->

### How to test
<!-- Describe the testing procedure for reviewers, if changes are
not fully covered by unit tests or manual testing can be complicated.
-->

### Checklist
<!-- Put an 'x' in all the boxes that apply -->
- [ ] I have added unit tests to cover my changes.​
- [ ] I have added integration tests to cover my changes.​
- [ ] I have added the description of my changes into
[CHANGELOG](https://github.com/openvinotoolkit/datumaro/blob/develop/CHANGELOG.md).​
- [ ] I have updated the
[documentation](https://github.com/openvinotoolkit/datumaro/tree/develop/docs)
accordingly

### License

- [ ] I submit _my code changes_ under the same [MIT
License](https://github.com/openvinotoolkit/datumaro/blob/develop/LICENSE)
that covers the project.
  Feel free to contact the maintainers if that's a concern.
- [ ] I have updated the license header for each file (see an example
below).

```python
# Copyright (C) 2023 Intel Corporation
#
# SPDX-License-Identifier: MIT
```

---------

Signed-off-by: Kim, Vinnam <vinnam.kim@intel.com>
Co-authored-by: Vinnam Kim <vinnam.kim@intel.com>
  • Loading branch information
wonjuleee and vinnamkim authored Apr 17, 2023
1 parent 57ccba7 commit 26fba49
Show file tree
Hide file tree
Showing 9 changed files with 952 additions and 457 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/publish_sdist_to_pypi.yml
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ jobs:
uses: actions-ecosystem/action-regex-match@v2
with:
text: ${{ github.ref }}
regex: '^refs/tags/v[0-9]+\.[0-9]+\.[0-9]+$'
regex: '^refs/tags/v[0-9]+\.[0-9]+\.[0-9]+(rc[0-9]+)?$'
- name: Publish package distributions to PyPI
if: ${{ steps.check-tag.outputs.match != '' }}
uses: pypa/gh-action-pypi-publish@v1.7.1
Expand Down
749 changes: 349 additions & 400 deletions datumaro/plugins/validators.py

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -1,22 +1,22 @@
=============
===============================
Level 3: Data Import and Export
=============
===============================

Datumaro is a tool that supports public data formats across a wide range of tasks such as
classification, detection, segmentation, pose estimation, or visual tracking.
To facilitate this, Datumaro provides assistance with data import and export via both Python API and CLI.
This makes it easier for users to work with various data formats using Datumaro.

Prepare dataset
============
===============

For the segmentation task, we here introduce the Cityscapes, which collects road scenes from 50
different cities and contains 5K fine-grained pixel-level annotations and 20K coarse annotations.
More detailed description is given by :ref:`here <Cityscapes>`.
The Cityscapes dataset is available for free `download <https://www.cityscapes-dataset.com/downloads/>`_.

Convert data format
============
===================

Users sometimes needs to compare, merge, or manage various kinds of public datasets in a unified
system. To achieve this, Datumaro not only has `import` and `export` funcionalities, but also
Expand Down Expand Up @@ -59,32 +59,32 @@ We now convert the Cityscapes data into the MS-COCO format, which is described i

.. code-block:: bash
datum create -o <path/to/project>
datum project create -o <path/to/project>
We now import Cityscapes data into the project through

.. code-block:: bash
datum import --format cityscapes -p <path/to/project> <path/to/cityscapes>
datum project import --format cityscapes -p <path/to/project> <path/to/cityscapes>
(Optional) When we import a data, the change is automatically commited in the project.
This can be shown through `log` as

.. code-block:: bash
datum log -p <path/to/project>
datum project log -p <path/to/project>
(Optional) We can check the imported dataset information such as subsets, number of data, or
categories through `info`.

.. code-block:: bash
datum info -p <path/to/project>
datum project info -p <path/to/project>
Finally, we export the data within the project with MS-COCO format as

.. code-block:: bash
datum export --format coco -p <path/to/project> -o <path/to/save> -- --save-media
datum project export --format coco -p <path/to/project> -o <path/to/save> -- --save-media
For a data with an unknown format, we can detect the format in the :ref:`next level <Level 4: Detect Data Format from an Unknown Dataset>`!
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
=============
===================================================
Level 4: Detect Data Format from an Unknown Dataset
=============
===================================================

Datumaro provides a function to detect the format of a dataset before importing data. This can be
useful in cases where information about the original format of the data has been lost or is unclear.
With this function, users can easily identify the format and proceed with appropriate data
handling processes.

Detect data format
============
==================

.. tabbed:: CLI

Expand Down

This file was deleted.

73 changes: 73 additions & 0 deletions docs/source/docs/level-up/intermediate_skills/08_data_validate.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
===========================
Level 8: Dataset Validation
===========================


When creating a dataset, it is natural for imbalances to occur between categories, and sometimes
there may be very few data points for the minority class. In addition, inconsistent annotations may
be produced by annotators or over time. When training a model with such data, more attention should
be paid, and sometimes it may be necessary to filter or correct the data in advance. Datumaro provides
data validation functionality for this purpose.

More detailed descriptions about validation errors and warnings are given by :ref:`here <Validate>`.
The Python example for the usage of validator is described in `here <https://github.com/openvinotoolkit/datumaro/blob/develop/notebooks/11_validate.ipynb>`_.


.. tab-set::

.. tab-item:: Python

.. code-block:: python
from datumaro.components.environment import Environment
from datumaro.components.dataset import Dataset
data_path = '/path/to/data'
env = Environment()
detected_formats = env.detect_dataset(data_path)
dataset = Dataset.import_from(path, detected_formats[0])
from datumaro.plugins.validators import DetectionValidator
validator = DetectionValidator() # Or ClassificationValidator or SegementationValidator
reports = validator.validate(dataset)
.. tab-item:: ProjectCLI

With the project-based CLI, we first require to create a project by

.. code-block:: bash
datum project create -o <path/to/project>
We now import MS-COCO validation data into the project through

.. code-block:: bash
datum project import --format coco_instances -p <path/to/project> <path/to/cityscapes>
(Optional) When we import a data, the change is automatically commited in the project.
This can be shown through `log` as

.. code-block:: bash
datum project log -p <path/to/project>
(Optional) We can check the imported dataset information such as subsets, number of data, or
categories through `info`.

.. code-block:: bash
datum project dinfo -p <path/to/project>
Finally, we validate the data within the project as

.. code-block:: bash
datum validate --task-type <classification/detection/segmentation> --subset <subset_name> -p <path/to/project>
We now have the validation report named by validation-report-<subset_name>.json.
6 changes: 3 additions & 3 deletions docs/source/docs/level-up/intermediate_skills/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,12 +31,12 @@ Intermediate Skills

---

.. link-button:: 08_data_refinement
.. link-button:: 08_data_validate
:type: ref
:text: Level 08: Dataset Refinement
:text: Level 08: Dataset Validate
:classes: btn-outline-primary btn-block stretched-link

:badge:`CLI,badge-info`
:badge:`ProjectCLI,badge-primary`
:badge:`Python,badge-warning`

---
Expand Down
500 changes: 500 additions & 0 deletions notebooks/11_validate.ipynb

Large diffs are not rendered by default.

27 changes: 14 additions & 13 deletions tests/unit/test_validator.py
Original file line number Diff line number Diff line change
Expand Up @@ -405,10 +405,13 @@ def test_check_missing_attribute(self):

@mark_requirement(Requirements.DATUM_GENERAL_REQ)
def test_check_undefined_label(self):
label_name = "unittest"
label_stats = {"items_with_undefined_label": [(1, "unittest")]}
label_name = "cat0"
item_id = 1
item_subset = "unittest"
label_stats = {label_name: {"items_with_undefined_label": [(item_id, item_subset)]}}
stats = {"label_distribution": {"undefined_labels": label_stats}}

actual_reports = self.validator._check_undefined_label(label_name, label_stats)
actual_reports = self.validator._check_undefined_label(stats)

self.assertTrue(len(actual_reports) == 1)
self.assertIsInstance(actual_reports[0], UndefinedLabel)
Expand Down Expand Up @@ -455,14 +458,12 @@ def test_check_only_one_label(self):
self.assertIsInstance(actual_reports[0], OnlyOneLabel)

@mark_requirement(Requirements.DATUM_GENERAL_REQ)
def test_check_only_one_attribute_value(self):
def test_check_only_one_attribute(self):
label_name = "unit"
attr_name = "test"
attr_dets = {"distribution": {"mock": 1}}

actual_reports = self.validator._check_only_one_attribute_value(
label_name, attr_name, attr_dets
)
actual_reports = self.validator._check_only_one_attribute(label_name, attr_name, attr_dets)

self.assertTrue(len(actual_reports) == 1)
self.assertIsInstance(actual_reports[0], OnlyOneAttributeValue)
Expand Down Expand Up @@ -897,7 +898,7 @@ def test_validate_annotations_detection(self):
self.assertEqual(actual_stats["items_with_negative_length"], {})
self.assertEqual(actual_stats["items_with_invalid_value"], {})

bbox_dist_by_label = actual_stats["bbox_distribution_in_label"]
bbox_dist_by_label = actual_stats["point_distribution_in_label"]
label_prop_stats = bbox_dist_by_label["label_1"]["width"]
self.assertEqual(label_prop_stats["items_far_from_mean"], {})
self.assertEqual(label_prop_stats["mean"], 3.5)
Expand All @@ -906,7 +907,7 @@ def test_validate_annotations_detection(self):
self.assertEqual(label_prop_stats["max"], 4.0)
self.assertEqual(label_prop_stats["median"], 3.5)

bbox_dist_by_attr = actual_stats["bbox_distribution_in_attribute"]
bbox_dist_by_attr = actual_stats["point_distribution_in_attribute"]
attr_prop_stats = bbox_dist_by_attr["label_0"]["a"]["1"]["width"]
self.assertEqual(attr_prop_stats["items_far_from_mean"], {})
self.assertEqual(attr_prop_stats["mean"], 2.0)
Expand All @@ -915,7 +916,7 @@ def test_validate_annotations_detection(self):
self.assertEqual(attr_prop_stats["max"], 3.0)
self.assertEqual(attr_prop_stats["median"], 2.0)

bbox_dist_item = actual_stats["bbox_distribution_in_dataset_item"]
bbox_dist_item = actual_stats["point_distribution_in_dataset_item"]
self.assertEqual(sum(bbox_dist_item.values()), 8)

with self.subTest("Test of validation reports", i=1):
Expand Down Expand Up @@ -948,7 +949,7 @@ def test_validate_annotations_segmentation(self):
self.assertEqual(len(actual_stats["items_missing_annotation"]), 1)
self.assertEqual(actual_stats["items_with_invalid_value"], {})

mask_dist_by_label = actual_stats["mask_distribution_in_label"]
mask_dist_by_label = actual_stats["point_distribution_in_label"]
label_prop_stats = mask_dist_by_label["label_1"]["area"]
self.assertEqual(label_prop_stats["items_far_from_mean"], {})
areas = [12, 4, 8]
Expand All @@ -958,7 +959,7 @@ def test_validate_annotations_segmentation(self):
self.assertEqual(label_prop_stats["max"], np.max(areas))
self.assertEqual(label_prop_stats["median"], np.median(areas))

mask_dist_by_attr = actual_stats["mask_distribution_in_attribute"]
mask_dist_by_attr = actual_stats["point_distribution_in_attribute"]
attr_prop_stats = mask_dist_by_attr["label_0"]["a"]["1"]["area"]
areas = [12, 4]
self.assertEqual(attr_prop_stats["items_far_from_mean"], {})
Expand All @@ -968,7 +969,7 @@ def test_validate_annotations_segmentation(self):
self.assertEqual(attr_prop_stats["max"], np.max(areas))
self.assertEqual(attr_prop_stats["median"], np.median(areas))

mask_dist_item = actual_stats["mask_distribution_in_dataset_item"]
mask_dist_item = actual_stats["point_distribution_in_dataset_item"]
self.assertEqual(sum(mask_dist_item.values()), 9)

with self.subTest("Test of validation reports", i=1):
Expand Down

0 comments on commit 26fba49

Please sign in to comment.