Merge branch 'develop' into mergeback/1.9.1

openvinotoolkit · Sep 27, 2024 · ae2eda5 · ae2eda5
2 parents ec9f3ba + c4d7bb4
commit ae2eda5
Show file tree

Hide file tree

Showing 36 changed files with 1,042 additions and 39 deletions.
diff --git a/.github/workflows/publish_to_pypi.yml b/.github/workflows/publish_to_pypi.yml
@@ -80,12 +80,12 @@ jobs:
         file_glob: true
     - name: Publish package distributions to PyPI
       if: ${{ steps.check-tag.outputs.match != '' }}
-      uses: pypa/gh-action-pypi-publish@v1.10.1
+      uses: pypa/gh-action-pypi-publish@v1.10.2
       with:
         password: ${{ secrets.PYPI_API_TOKEN }}
     - name: Publish package distributions to TestPyPI
       if: ${{ steps.check-tag.outputs.match == '' }}
-      uses: pypa/gh-action-pypi-publish@v1.10.1
+      uses: pypa/gh-action-pypi-publish@v1.10.2
       with:
         password: ${{ secrets.TESTPYPI_API_TOKEN }}
         repository-url: https://test.pypi.org/legacy/

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -5,9 +5,23 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
-## \[Q4 2024 Release 1.9.1\]
+## \[Unreleased\]
+
 ### New features
+- Support KITTI 3D format
+  (<https://github.com/openvinotoolkit/datumaro/pull/1619>)
+- Add PseudoLabeling transform for unlabeled dataset
+  (<https://github.com/openvinotoolkit/datumaro/pull/1594>)
+
+### Enhancements
+- Raise an appropriate error when exporting a datumaro dataset if its subset name contains path separators.
+  (<https://github.com/openvinotoolkit/datumaro/pull/1615>)
+- Update docs for transform plugins
+  (<https://github.com/openvinotoolkit/datumaro/pull/1599>)
+
+### Bug fixes
 
+## Q4 2024 Release 1.9.1
 ### Enhancements
 - Support multiple labels for kaggle format
   (<https://github.com/openvinotoolkit/datumaro/pull/1607>)
@@ -22,6 +36,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ### New features
 - Add a new CLI command: datum format
   (<https://github.com/openvinotoolkit/datumaro/pull/1570>)
+- Add a new Cuboid2D annotation type
+  (<https://github.com/openvinotoolkit/datumaro/pull/1601>)
 - Support language dataset for DmTorchDataset
   (<https://github.com/openvinotoolkit/datumaro/pull/1592>)
 

diff --git a/docs/source/docs/command-reference/context_free/transform.md b/docs/source/docs/command-reference/context_free/transform.md
@@ -101,7 +101,10 @@ Basic dataset item manipulations:
 - [`remove_images`](#remove_images) - Removes specific images
 - [`remove_annotations`](#remove_annotations) - Removes annotations
 - [`remove_attributes`](#remove_attributes) - Removes attributes
-- [`astype_annotations`](#astype_annotations) - Convert annotation type
+- [`astype_annotations`](#astype_annotations) - Transforms annotation types
+- [`pseudo_labeling`](#pseudo_labeling) - Generates pseudo labels for unlabeled data
+- [`correct`](#correct) - Corrects annotaiton types
+- [`clean`](#clean) - Removes noisy data for tabular dataset
 
 Subset manipulations:
 - [`random_split`](#random_split) - Splits dataset into subsets
@@ -826,6 +829,35 @@ bbox_values_decrement [-h]
 Optional arguments:
 - `-h`, `--help` (flag) - Show this help message and exit
 
+#### `pseudo_labeling`
+
+Assigns pseudo-labels to items in a dataset based on their similarity to predefined labels. This class is useful for semi-supervised learning when dealing with missing or uncertain labels.
+
+The process includes:
+
+- Similarity Computation: Uses hashing techniques to compute the similarity between items and predefined labels.
+- Pseudo-Label Assignment: Assigns the most similar label as a pseudo-label to each item.
+
+Attributes:
+
+- `extractor` (IDataset) - Provides access to dataset items and their annotations.
+- `labels` (Optional[List[str]]) - List of predefined labels for pseudo-labeling. Defaults to all available labels if not provided.
+- `explorer` (Optional[Explorer]) - Computes hash keys for items and labels. If not provided, a new Explorer is created.
+
+Usage:
+```console
+pseudo_labeling [-h] [--labels LABELS]
+
+Optional arguments:
+- `-h`, `--help` (flag) - Show this help message and exit
+- `--labels` (str) - Comma-separated list of label names for pseudo-labeling
+
+Examples:
+- Assign pseudo-labels based on predefined labels
+  ```console
+  datum transform -t pseudo_labeling -- --labels 'label1,label2'
+  ```
+
 #### `correct`
 
 Correct the dataset from a validation report
@@ -838,3 +870,27 @@ correct [-h] [-r REPORT_PATH]
 Optional arguments:
 - `-h`, `--help` (flag) - Show this help message and exit
 - `-r`, `--reports` (str) - A validation report from a 'validate' CLI (default=validation_reports.json)
+
+#### `clean`
+
+Refines and preprocesses media items in a dataset, focusing on string, numeric, and categorical data. This transform is designed to clean and improve the quality of the data, making it more suitable for analysis and modeling.
+
+The cleaning process includes:
+
+- String Data: Removes unnecessary characters using NLP techniques.
+- Numeric Data: Identifies and handles outliers and missing values.
+- Categorical Data: Cleans and refines categorical information.
+
+Usage:
+```console
+clean [-h]
+```
+
+Optional arguments:
+- `-h`, `--help` (flag) - Show this help message and exit
+
+Examples:
+- Clean and preprocess dataset items
+  ```console
+  datum transform -t clean
+  ```
diff --git a/docs/source/docs/data-formats/formats/datumaro.md b/docs/source/docs/data-formats/formats/datumaro.md
@@ -73,6 +73,8 @@ A Datumaro dataset directory should have the following structure:
         └── ...
 ```
 
+Note that the subset name shouldn't contain path separators.
+
 If your dataset is not following the above directory structure,
 it cannot detect and import your dataset as the Datumaro format properly.
 

diff --git a/docs/source/docs/data-formats/formats/datumaro_binary.md b/docs/source/docs/data-formats/formats/datumaro_binary.md
@@ -113,6 +113,8 @@ A DatumaroBinary dataset directory should have the following structure:
         └── ...
 ```
 
+Note that the subset name shouldn't contain path separators.
+
 If your dataset is not following the above directory structure,
 it cannot detect and import your dataset as the DatumaroBinary format properly.
 

diff --git a/docs/source/docs/release_notes.rst b/docs/source/docs/release_notes.rst
@@ -4,6 +4,18 @@ Release Notes
 .. toctree::
    :maxdepth: 1
 
+v1.9.1 (2024 Q3)
+----------------
+
+Enhancements
+^^^^^^^^^^^^
+- Support multiple labels for kaggle format
+- Use DataFrame.map instead of DataFrame.applymap
+
+Bug fixes
+^^^^^^^^^
+- Fix StreamDataset merging when importing in eager mode
+
 v1.9.0 (2024 Q3)
 ----------------
 

diff --git a/src/datumaro/components/annotation.py b/src/datumaro/components/annotation.py
@@ -50,6 +50,7 @@ class AnnotationType(IntEnum):
     feature_vector = 13
     tabular = 14
     rotated_bbox = 15
+    cuboid_2d = 16
 
 
 COORDINATE_ROUNDING_DIGITS = 2
@@ -1363,6 +1364,41 @@ def wrap(item, **kwargs):
         return attr.evolve(item, **d)
 
 
+@attrs(slots=True, init=False, order=False)
+class Cuboid2D(Annotation):
+    """
+    Cuboid2D annotation class. This class represents a 3D bounding box defined by its point coordinates
+    in the following way:
+    [(x1, y1), (x2, y2), (x3, y3), (x4, y4), (x5, y5), (x6, y6), (x7, y7), (x8, y8)].
+
+
+      6---7
+     /|  /|
+    5-+-8 |
+    | 2 + 3
+    |/  |/
+    1---4
+
+    Attributes:
+        _type (AnnotationType): The type of annotation, set to `AnnotationType.bbox`.
+
+    Methods:
+        __init__: Initializes the Cuboid2D with its coordinates.
+        wrap: Creates a new Bbox instance with updated attributes.
+    """
+
+    _type = AnnotationType.cuboid_2d
+    points = field(default=None)
+    label: Optional[int] = field(
+        converter=attr.converters.optional(int), default=None, kw_only=True
+    )
+    z_order: int = field(default=0, validator=default_if_none(int), kw_only=True)
+
+    def __init__(self, _points: Iterable[Tuple[float, float]], *args, **kwargs):
+        kwargs.pop("points", None)  # comes from wrap()
+        self.__attrs_init__(points=_points, *args, **kwargs)
+
+
 @attrs(slots=True, order=False)
 class PointsCategories(Categories):
     """

diff --git a/src/datumaro/components/annotations/matcher.py b/src/datumaro/components/annotations/matcher.py
@@ -35,6 +35,7 @@
     "ImageAnnotationMatcher",
     "HashKeyMatcher",
     "FeatureVectorMatcher",
+    "Cuboid2DMatcher",
 ]
 
 
@@ -378,3 +379,8 @@ def distance(self, a, b):
         b = Points([p for pt in b.as_polygon() for p in pt])
 
         return OKS(a, b, sigma=self.sigma)
+
+
+@attrs
+class Cuboid2DMatcher(ShapeMatcher):
+    pass
diff --git a/src/datumaro/components/annotations/merger.py b/src/datumaro/components/annotations/merger.py
@@ -12,6 +12,7 @@
     AnnotationMatcher,
     BboxMatcher,
     CaptionsMatcher,
+    Cuboid2DMatcher,
     Cuboid3dMatcher,
     FeatureVectorMatcher,
     HashKeyMatcher,
@@ -210,3 +211,8 @@ class TabularMerger(AnnotationMerger, TabularMatcher):
 @attrs
 class RotatedBboxMerger(_ShapeMerger, RotatedBboxMatcher):
     pass
+
+
+@attrs
+class Cuboid2DMerger(_ShapeMerger, Cuboid2DMatcher):
+    pass
diff --git a/src/datumaro/components/errors.py b/src/datumaro/components/errors.py
@@ -342,6 +342,16 @@ def __str__(self):
         return f"Item {self.item_id} is repeated in the source sequence."
 
 
+@define(auto_exc=False)
+class PathSeparatorInSubsetNameError(DatasetError):
+    subset: str = field()
+
+    def __str__(self):
+        return (
+            f"Failed to export the subset '{self.subset}': subset name contains path separator(s)."
+        )
+
+
 class DatasetQualityError(DatasetError):
     pass
 

diff --git a/src/datumaro/components/merge/intersect_merge.py b/src/datumaro/components/merge/intersect_merge.py
@@ -19,6 +19,7 @@
     AnnotationMerger,
     BboxMerger,
     CaptionsMerger,
+    Cuboid2DMerger,
     Cuboid3dMerger,
     EllipseMerger,
     FeatureVectorMerger,
@@ -455,6 +456,8 @@ def _for_type(t, **kwargs):
                 return _make(TabularMerger, **kwargs)
             elif t is AnnotationType.rotated_bbox:
                 return _make(RotatedBboxMerger, **kwargs)
+            elif t is AnnotationType.cuboid_2d:
+                return _make(Cuboid2DMerger, **kwargs)
             else:
                 raise NotImplementedError("Type %s is not supported" % t)
 

diff --git a/src/datumaro/components/visualizer.py b/src/datumaro/components/visualizer.py
@@ -19,6 +19,7 @@
     AnnotationType,
     Bbox,
     Caption,
+    Cuboid2D,
     Cuboid3d,
     DepthAnnotation,
     Ellipse,
@@ -661,6 +662,39 @@ def _draw_cuboid_3d(
     ) -> None:
         raise NotImplementedError(f"{ann.type} is not implemented yet.")
 
+    def _draw_cuboid_2d(
+        self,
+        ann: Cuboid2D,
+        label_categories: Optional[LabelCategories],
+        fig: Figure,
+        ax: Axes,
+        context: List,
+    ) -> None:
+        import matplotlib.patches as patches
+
+        points = ann.points
+        color = self._get_color(ann)
+        label_text = label_categories[ann.label].name if label_categories is not None else ann.label
+
+        # Define the faces based on vertex indices
+
+        faces = [
+            [points[i] for i in [0, 1, 2, 3]],  # Bottom face
+            [points[i] for i in [4, 5, 6, 7]],  # Top face
+            [points[i] for i in [0, 1, 5, 4]],  # Front face
+            [points[i] for i in [1, 2, 6, 5]],  # Right face
+            [points[i] for i in [2, 3, 7, 6]],  # Back face
+            [points[i] for i in [3, 0, 4, 7]],  # Left face
+        ]
+        ax.text(points[0][0], points[0][1] - self.text_y_offset, label_text, color=color)
+
+        # Draw each face
+        for face in faces:
+            polygon = patches.Polygon(
+                face, fill=False, linewidth=self.bbox_linewidth, edgecolor=color
+            )
+            ax.add_patch(polygon)
+
     def _draw_super_resolution_annotation(
         self,
         ann: SuperResolutionAnnotation,

diff --git a/src/datumaro/plugins/data_formats/datumaro/base.py b/src/datumaro/plugins/data_formats/datumaro/base.py
@@ -11,6 +11,7 @@
     AnnotationType,
     Bbox,
     Caption,
+    Cuboid2D,
     Cuboid3d,
     Ellipse,
     GroupType,
@@ -378,6 +379,18 @@ def _load_annotations(self, item: Dict):
 
                 elif ann_type == AnnotationType.hash_key:
                     continue
+                elif ann_type == AnnotationType.cuboid_2d:
+                    loaded.append(
+                        Cuboid2D(
+                            list(map(tuple, points)),
+                            label=label_id,
+                            id=ann_id,
+                            attributes=attributes,
+                            group=group,
+                            object_id=object_id,
+                            z_order=z_order,
+                        )
+                    )
                 else:
                     raise NotImplementedError()
             except Exception as e: