Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Moving Dataset Exctractors outside of main README #177

Merged
merged 4 commits into from
Aug 31, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
161 changes: 48 additions & 113 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ Non-exhaustive list of supported features.

> 📘 **Deep Dive into Data Profiling**
> Puzzled by some dataset challenges while using DataGradients? We've got you covered.
> Enrich your understanding with our **[🎓free online course](https://deci.ai/course/profiling-computer-vision-datasets-overview/?utm_campaign[…]=DG-PDF-report&utm_medium=DG-repo&utm_content=DG-Report-to-course)**. Dive into dataset profiling, confront its complexities, and harness the full potential of DataGradients.
> Enrich your understanding with this **[🎓free online course](https://deci.ai/course/profiling-computer-vision-datasets-overview/?utm_campaign[…]=DG-PDF-report&utm_medium=DG-repo&utm_content=DG-Report-to-course)**. Dive into dataset profiling, confront its complexities, and harness the full potential of DataGradients.


<div align="center">
Expand Down Expand Up @@ -69,10 +69,7 @@ Non-exhaustive list of supported features.
- [Dataset Analysis](#dataset-analysis)
- [Report](#report)
- [Feature Configuration](#feature-configuration)
- [Dataset Adapters](#dataset-adapters)
- [Image Adapter](#image-adapter)
- [Label Adapter](#label-adapter)
- [Example](#example)
- [Dataset Extractors](#dataset-extractors)
- [Pre-computed Dataset Analysis](#pre-computed-dataset-analysis)
- [License](#license)

Expand All @@ -91,23 +88,17 @@ pip install data-gradients
### Prerequisites

- **Dataset**: Includes a **Train** set and a **Validation** or a **Test** set.
- **Class Names**: A list of the unique categories present in your dataset.
- **Iterable**: A method to iterate over your Dataset providing images and labels. Can be any of the following:
- PyTorch Dataloader
- PyTorch Dataset
- **Dataset Iterable**: A method to iterate over your Dataset providing images and labels. Can be any of the following:
- PyTorch **Dataloader**
- PyTorch **Dataset**
- Generator that yields image/label pairs
- Any other iterable you use for model training/validation
- One of:
- **Class Names**: A list of the unique categories present in your dataset.
- **Number of classes**: Indicate how many unique classes are in your dataset. Ensure this number is greater than the highest class index (e.g., if your highest class index is 9, the number of classes should be at least 10).

Please ensure all the points above are checked before you proceed with **DataGradients**.

**Good to Know**: DataGradients will try to find out how the dataset returns images and labels.
- If something cannot be automatically determined, you will be asked to provide some extra information through a text input.
- In some extreme cases, the process will crash and invite you to implement a custom dataset adapter (see relevant section)

**Heads up**: We currently don't provide out-of-the-box dataset/dataloader implementation.
You can find multiple dataset implementations in [PyTorch](https://pytorch.org/vision/stable/datasets.html)
or [SuperGradients](https://docs.deci.ai/super-gradients/src/super_gradients/training/datasets/Dataset_Setup_Instructions.html).

**Example**
``` python
from torchvision.datasets import CocoDetection
Expand All @@ -117,17 +108,43 @@ val_data = CocoDetection(...)
class_names = ["person", "bicycle", "car", "motorcycle", ...]
```

> **Good to Know** - DataGradients will try to find out how the dataset returns images and labels.
> - If something cannot be automatically determined, you will be asked to provide some extra information through a text input.
> - In some extreme cases, the process will crash and invite you to implement a custom [dataset extractor](#dataset-extractors)

> **Heads up** - DataGradients provides a few out-of-the-box [dataset/dataloader](./documentation/datasets.md) implementation.
> You can find more dataset implementations in [PyTorch](https://pytorch.org/vision/stable/datasets.html)
> or [SuperGradients](https://docs.deci.ai/super-gradients/src/super_gradients/training/datasets/Dataset_Setup_Instructions.html).

### Dataset Analysis

## Dataset Analysis
You are now ready to go, chose the relevant analyzer for your task and run it over your datasets!

**Image Classification**
```python
from data_gradients.managers.classification_manager import ClassificationAnalysisManager

train_data = ... # Your dataset iterable (torch dataset/dataloader/...)
val_data = ... # Your dataset iterable (torch dataset/dataloader/...)
class_names = ... # [<class-1>, <class-2>, ...]

analyzer = ClassificationAnalysisManager(
report_title="Testing Data-Gradients Classification",
train_data=train_data,
val_data=val_data,
class_names=class_names,
)

analyzer.run()
```

**Object Detection**
```python
from data_gradients.managers.detection_manager import DetectionAnalysisManager

train_data = ...
val_data = ...
class_names = ...
train_data = ... # Your dataset iterable (torch dataset/dataloader/...)
val_data = ... # Your dataset iterable (torch dataset/dataloader/...)
class_names = ... # [<class-1>, <class-2>, ...]

analyzer = DetectionAnalysisManager(
report_title="Testing Data-Gradients Object Detection",
Expand All @@ -144,9 +161,9 @@ analyzer.run()
```python
from data_gradients.managers.segmentation_manager import SegmentationAnalysisManager

train_data = ...
val_data = ...
class_names = ...
train_data = ... # Your dataset iterable (torch dataset/dataloader/...)
val_data = ... # Your dataset iterable (torch dataset/dataloader/...)
class_names = ... # [<class-1>, <class-2>, ...]

analyzer = SegmentationAnalysisManager(
report_title="Testing Data-Gradients Segmentation",
Expand All @@ -164,8 +181,8 @@ You can test the segmentation analysis tool in the following [example](https://g
which does not require you to download any additional data.


### Report
Once the analysis is done, the path to your pdf report will be printed.
## Report
Once the analysis is done, the path to your pdf report will be printed. You can find here examples of [pre-computed dataset analysis reports](#pre-computed-dataset-analysis).


## Feature Configuration
Expand All @@ -174,97 +191,15 @@ The feature configuration allows you to run the analysis on a subset of features
If you are interested in customizing this configuration, you can check out the [documentation](documentation/feature_configuration.md) on that topic.


## Dataset Adapters
Before implementing a Dataset Adapter try running without it, in many cases DataGradient will support your dataset without any code.

Two type of Dataset Adapters are available: `images_extractor` and `labels_extractor`. These functions should be passed to the main Analyzer function init.
## Dataset Extractors
**Ensuring Comprehensive Dataset Compatibility**

```python
from data_gradients.managers.segmentation_manager import SegmentationAnalysisManager

train_data = ...
val_data = ...
DataGradients is adept at automatic dataset inference; however, certain specificities, such as nested annotations structures or unique annotation format, may necessitate a tailored approach.

# Let Assume that in this case, the train_data and val_data return data in this format:
# (image, {"masks", "bboxes"})
images_extractor = lambda data: data[0] # Extract the image
labels_extractor = lambda data: data[1]['masks'] # Extract the masks
To address this, DataGradients offers `extractors` tailored for enhancing compatibility with diverse dataset formats.

# In case of segmentation.
SegmentationAnalysisManager(
report_title="Test with Adapters",
train_data=train_data,
val_data=val_data,
images_extractor=images_extractor,
labels_extractor=labels_extractor,
)
For an in-depth understanding and implementation details, we encourage a thorough review of the [Dataset Extractors Documentation](./documentation/dataset_extractors.md).

# For Detection, just change the Manager and the label_extractor definition.
```

### Image Adapter
Image Adapter functions should respect the following:

`images_extractor(data: Any) -> torch.Tensor`

- `data` being the output of the dataset/dataloader that you provided.
- The function should return a Tensor representing your image(s). One of:
- `(BS, C, H, W)`, `(BS, H, W, C)`, `(BS, H, W)` for batch
- `(C, H, W)`, `(H, W, C)`, `(H, W)` for single image
- With `C`: number of channels (3 for RGB)


### Label Adapter
Label Adapter functions should respect the following:

`labels_extractor(data: Any) -> torch.Tensor`

- `data` being the output of the dataset/dataloader that you provided.
- The function should return a Tensor representing your labels(s):
- For **Segmentation**, one of:
- `(BS, C, H, W)`, `(BS, H, W, C)`, `(BS, H, W)` for batch
- `(C, H, W)`, `(H, W, C)`, `(H, W)` for single image
- `BS`: Batch Size
- `C`: number of channels - 3 for RGB
- `H`, `W`: Height and Width
- For **Detection**, one of:
- `(BS, N, 5)`, `(N, 6)` for batch
- `(N, 5)` for single image
- `BS`: Batch Size
- `N`: Padding size
- The last dimension should include your `class_id` and `bbox` - `class_id, x, y, x, y` for instance


### Example

Let's imagine that your dataset returns a couple of `(image, annotation)` with `annotation` as below:
``` python
annotation = [
{"bbox_coordinates": [1.08, 187.69, 611.59, 285.84], "class_id": 51},
{"bbox_coordinates": [5.02, 321.39, 234.33, 365.42], "class_id": 52},
...
]
```

Because this dataset includes a very custom type of `annotation`, you will need to implement your own custom `labels_extractor` as below:
``` python
from data_gradients.managers.segmentation_manager import SegmentationAnalysisManager

def labels_extractor(data: Tuple[PIL.Image.Image, List[Dict]]) -> torch.Tensor:
_image, annotations = data[:2]
labels = []
for annotation in annotations:
class_id = annotation["class_id"]
bbox = annotation["bbox_coordinates"]
labels.append((class_id, *bbox))
return torch.Tensor(labels)


SegmentationAnalysisManager(
...,
labels_extractor=labels_extractor
)
```


## Pre-computed Dataset Analysis
Expand Down
114 changes: 114 additions & 0 deletions documentation/dataset_extractors.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
# Dataset Extractors in DataGradients

**If your dataset isn't plug-and-play with DataGradients, Dataset Extractors are here to help!**

## Table of Contents
1. [Introduction](#1-introduction)
2. [What are Dataset Extractors?](#2-what-are-dataset-extractors)
3. [When Do You Need Dataset Extractors?](#3-when-do-you-need-dataset-extractors)
4. [Implementing Dataset Extractors](#4-implementing-dataset-extractors)
5. [Extractor Structures](#5-extractor-structures)
- [Image Extractor](#image-extractor)
- [Label Extractor](#label-extractor)
6. [Practical Example](#6-practical-example)


## 1. Introduction
DataGradients aims to automatically recognize your dataset's structure and output format.
This includes variations in image channel order, bounding box format, and segmentation mask type.

However, unique datasets, especially with a nested data structure, may require Dataset Extractors for customized handling.


## 2. What are Dataset Extractors?
Dataset Extractors are user-defined functions that guide DataGradients in interpreting non-standard datasets.
The two primary extractors are:
- **`images_extractor`**: Responsible for extracting image data in a friendly format.
- **`labels_extractor`**: Responsible for extracting label data in a friendly format.


## 3. When Do You Need Dataset Extractors?
DataGradients is designed to automatically recognize standard dataset structures.
Yet, intricate or nested formats might be challenging for auto-inference.

For these unique datasets, Dataset Extractors ensure seamless interfacing with DataGradients.


## 4. Implementing Dataset Extractors
After determining the need for extractors, integrate them during the instantiation of the Analysis Manager.
For illustration:

```python
from data_gradients.managers.segmentation_manager import SegmentationAnalysisManager

# Sample dataset returns: (image, {"masks", "bboxes"})
images_extractor = lambda data: data[0] # Extract the image
labels_extractor = lambda data: data[1]['masks'] # Extract the masks

SegmentationAnalysisManager(
report_title="Test with Extractors",
train_data=train_data,
val_data=val_data,
images_extractor=images_extractor,
labels_extractor=labels_extractor
)
```

## 5. Extractor Structures

### Image Extractor
Function signature:
```python
images_extractor(data: Any) -> torch.Tensor
```
Output must be a tensor representing your image(s):
- Batched: `(BS, C, H, W)`, `(BS, H, W, C)`, `(BS, H, W)`
- Single Image: `(C, H, W)`, `(H, W, C)`, `(H, W)`
- Where:
- `C`: Number of channels (e.g., 3 for RGB)
- `BS`: Batch Size
- `H`, `W`: Height and Width, respectively

### Label Extractor
Function signature:
```python
labels_extractor(data: Any) -> torch.Tensor
```
Depending on the task, the tensor format will differ:

- **Segmentation**:
- Batched: `(BS, C, H, W)`, `(BS, H, W, C)`, `(BS, H, W)`
- Single Image: `(C, H, W)`, `(H, W, C)`, `(H, W)`
- **Detection**:
- Batched: `(BS, N, 5)`, `(N, 6)`
- Single Image: `(N, 5)`
- Last dimension details: `class_id, x1, y1, x2, y2`
- Where:
- `C`: Number of channels (e.g., 3 for RGB)
- `BS`: Batch Size
- `H`, `W`: Height and Width, respectively

## 6. Practical Example
For a dataset returning a tuple `(image, annotation)` where `annotation` is structured as follows:

```python
annotation = [
{"bbox_coordinates": [1.08, 187.69, 611.59, 285.84], "class_id": 51},
...
]
```

A suitable `labels_extractor` would be:

```python
import torch

def labels_extractor(data) -> torch.Tensor:
_, annotations = data # annotations = [{"bbox_coordinates": [1.08, 187.69, 611.59, 285.84], "class_id": 51}, ...]
labels = []
for annotation in annotations:
class_id = annotation["class_id"]
bbox = annotation["bbox_coordinates"]
labels.append((class_id, *bbox))
return torch.Tensor(labels) # np.array([[51, 1.08, 187.69, 611.59, 285.84], ...])
```