Datumaro

Concept

Datumaro is:

a tool to build composite datasets and iterate over them
a tool to create and maintain datasets
- Version control of annotations and images
- Publication (with removal of sensitive information)
- Editing
- Joining and splitting
- Exporting, format changing
- Image preprocessing
a dataset storage
a tool to debug datasets
- A network can be used to generate informative data subsets (e.g. with false-positives) to be analyzed further

Requirements

User interfaces
- a library
- a console tool with visualization means
Targets: single datasets, composite datasets, single images / videos
Built-in support for well-known annotation formats and datasets: CVAT, COCO, PASCAL VOC, Cityscapes, ImageNet
Extensibility with user-provided components
Lightweightness - it should be easy to start working with Datumaro
- Minimal dependency on environment and configuration
- It should be easier to use Datumaro than writing own code for computation of statistics or dataset manipulations

Functionality and ideas

Blur sensitive areas on dataset images
Dataset annotation filters, relabelling etc.
Dataset augmentation
Calculation of statistics:
- Mean & std, custom stats
"Edit" command to modify annotations
Versioning (for images, annotations, subsets, sources etc., comparison)
Documentation generation
Provision of iterators for user code
Dataset building (export in a specific format, indexation, statistics, documentation)
Dataset exporting to other formats
Dataset debugging (run inference, generate dataset slices, compute statistics)
"Explainable AI" - highlight network attention areas (paper)
- Black-box approach
  - Classification, Detection, Segmentation, Captioning
  - White-box approach

Research topics

exploration of network prediction uncertainty (aka Bayessian approach) Use case: explanation of network "quality", "stability", "certainty"
adversarial attacks on networks
dataset minification / reduction Use case: removal of redundant information to reach the same network quality with lesser training time
dataset expansion and filtration of additions Use case: add only important data
guidance for key frame selection for tracking (paper) Use case: more effective annotation, better predictions

RC 1 vision

In the first version Datumaro should be a project manager for CVAT. It should only consume data from CVAT. The collected dataset can be downloaded by user to be operated on with Datumaro CLI.

        User
          |
          v
 +------------------+
 |       CVAT       |
 +--------v---------+       +------------------+       +--------------+
 | Datumaro module  | ----> | Datumaro project | <---> | Datumaro CLI | <--- User
 +------------------+       +------------------+       +--------------+

Interfaces

Python API for user code
- Installation as a package
A command-line tool for dataset manipulations

Features

Dataset format support (reading, writing)
- Own format
- CVAT
- COCO
- PASCAL VOC
- YOLO
- TF Detection API
- Cityscapes
- ImageNet
Dataset visualization (show)
- Ability to visualize a dataset
  - with TensorBoard
Calculation of statistics for datasets
- Pixel mean, std
- Object counts (detection scenario)
- Image-Class distribution (classification scenario)
- Pixel-Class distribution (segmentation scenario)
- Image clusters
- Custom statistics
Dataset building
- Composite dataset building
- Annotation remapping
- Subset splitting
- Dataset filtering (extract)
- Dataset merging (merge)
- Dataset item editing (edit)
Dataset comparison (diff)
- Annotation-annotation comparison
- Annotation-inference comparison
- Annotation quality estimation (for CVAT)
  - Provide a simple method to check annotation quality with a model and generate summary
Dataset and model debugging
- Inference explanation (explain)
- Black-box approach (RISE paper)
- Ability to run a model on a dataset and read the results
CVAT-integration features
- Task export
  - Datumaro project export
  - Dataset export
  - Original raw data (images, a video file) can be downloaded (exported) together with annotations or just have links on CVAT server (in the future support S3, etc)
    - Be able to use local files instead of remote links
      - Specify cache directory
- Use case "annotate for model training"
  - create a task
  - annotate
  - export the task
  - convert to a training format
  - train a DL model
- Use case "annotate - reannotate problematic images - merge"
- Use case "annotate and estimate quality"
  - create a task
  - annotate
  - estimate quality of annotations

Optional features

Dataset publishing
- Versioning (for annotations, subsets, sources, etc.)
- Blur sensitive areas on images
- Tracking of legal information
- Documentation generation
Dataset building
- Dataset minification / Extraction of the most representative subset
  - Use case: generate low-precision calibration dataset
Dataset and model debugging
- Training visualization
- Inference explanation (explain)
  - White-box approach

Properties

Lightweightness
Modularity
Extensibility

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

design.md

design.md

Datumaro

Table of contents

Concept

Requirements

Functionality and ideas

Research topics

RC 1 vision

Interfaces

Features

Optional features

Properties

Files

design.md

Latest commit

History

design.md

File metadata and controls

Datumaro

Table of contents

Concept

Requirements

Functionality and ideas

Research topics

RC 1 vision

Interfaces

Features

Optional features

Properties