Update new_dataset_to_armory.md

Signed-off-by: Etienne Deprit <etienne.deprit@twosixtech.com>
twosixlabs · Nov 19, 2024 · 2e75ebe · 2e75ebe
1 parent 8361cf7
commit 2e75ebe
Showing 1 changed file with 84 additions and 33 deletions.
diff --git a/docs/new_dataset_to_armory.md b/docs/new_dataset_to_armory.md
@@ -4,80 +4,131 @@ This file presents two examples of how to add new datasets into armory-library.
 
 ## Torchvision
 
-The [SAMPLE (Synthetic and Measured Paired Labeled Experiment) dataset](https://github.com/benjaminlewis-afrl/SAMPLE_dataset_public) consists of measured SAR imagery from the MSTAR collection (Moving and Stationary Target Acquisition and Recognition) paired with synthetic SAR imagery. 
+The [SAMPLE (Synthetic and Measured Paired Labeled Experiment)](https://github.com/benjaminlewis-afrl/SAMPLE_dataset_public) dataset consists of measured SAR imagery from the MSTAR collection (Moving and Stationary Target Acquisition and Recognition) paired with synthetic SAR imagery. 
 
 The MSTAR dataset contains SAR imagery of 10 types of military vehicles illustrated in the figure below.
 
 ![MSTAR classes](./assets/MSTAR-classes.png)
 
 [Anas, H., Majdoulayne, H., Chaimae, A., & Nabil, S. M. (2020). Deep learning for sar image classification. In Intelligent Systems and Applications: Proceedings of the 2019 Intelligent Systems Conference (IntelliSys) Volume 1 (pp. 890-898). Springer International Publishing.](https://link.springer.com/chapter/10.1007/978-3-030-29516-5_67)
 
-For a Torchvision dataset, we load the dataset using the `ImageFolder` dataset builder, which automatically infers the class labels based on the directory names.
+The SAMPLE dataset is organized according to the `ImageFolder` pattern. The imagery is split into two normalizations -- decibel and quarter power magnitude (QPM).
+For each normalization type, real and synthetic SAR gray-scale imagery is partitioned into folders according to vehicle type.
+```
+ |-SAMPLE_dataset_public
+ | |-png_images
+ | | |-qpm
+ | | | |-real
+ | | | | |-m1
+ | | | | |-t72
+ | | | | |-btr70
+ | | | | |-m548
+ | | | | |-zsu23
+ | | | | |-bmp2
+ | | | | |-m35
+ | | | | |-m2
+ | | | | |-m60
+ | | | | |-2s1
+```
+
+For a Torchvision dataset, we load the dataset using the `ImageFolder` dataset class, which automatically infers
+the class labels based on the directory names. The `transform` parameter applies a chain of transformations
+that resize, normalize and ouput the images as numpy arrays.
+
 ```python
+import numpy as np
+import torchvision as tv
+from tv import transforms as T
+
+tmp_dir = Path('/tmp')
+sample_dir = tmp_dir / Path('SAMPLE_dataset_public')
 data_dir = sample_dir / Path("png_images", "qpm", "real")
-raw_dataset = datasets.load_dataset('imagefolder', data_dir=data_dir)
+
+tv_dataset = tv.datasets.ImageFolder(
+    root=data_dir,
+    transform=T.Compose(
+            [
+                T.Resize(size=(224, 224)),
+                T.ToTensor(),  # HWC->CHW and scales to 0-1
+                T.Normalize(mean=(0.5, 0.5, 0.5), std=(0.5, 0.5, 0.5)),
+                T.Lambda(np.asarray),
+            ]
+        ),
+)
 ```
 
-Next, we define train, validation, and test splits.
+Next, we use scikit-learn's [`train_test_split`](https://scikit-learn.org/dev/modules/generated/sklearn.model_selection.train_test_split.html)
+function to generate stratified train and test splits based on the dataset target classes.
 ```python
-train_dataset = raw_dataset['train'].train_test_split(
-    test_size=3/10,
-    stratify_by_column='label'
+from torch.utils.data import Subset
+from sklearn.model_selection import train_test_split
+
+# generate indices: instead of the actual data we pass in integers instead
+train_indices, test_indices, _, _ = train_test_split(
+    range(len(tv_dataset)),
+    tv_dataset.targets,
+    stratify=tv_dataset.targets,
+    test_size=0.25,
 )
 
-test_dataset = train_dataset['test'].train_test_split(
-    test_size=2/3,
-    stratify_by_column='label'
-)
+# generate subset based on indices
+train_split = Subset(tv_dataset, train_indices)
+test_split = Subset(tv_dataset, test_indices)
+```
 
-mstar_dataset = datasets.DatasetDict(
-    {
-        'train': train_dataset['train'],
-        'valid': test_dataset['train'],
-        'test': test_dataset['test']
-    }
-)
+Next, we wrap the training split into an armory-library dataset with the `TupleDataset` class.
+```python
+armory_dataset = armory.dataset.TupleDataset(train_split, ("image", "label"))
 ```
 
-Last, we integrate the dataset into Armory.
+Finally, we use the tuple dataset above to define an `ImageClassificationDataLoader` and
+evaluation dataset. Note that the armory-library `normalized_scale` must match the normalization
+transform defined by the Torchvision dataset.
 ```python
+normalized_scale = armory.data.Scale(
+    dtype=armory.data.DataType.FLOAT,
+    max=1.0,
+    mean=(0.5, 0.5, 0.5),
+    std=(0.5, 0.5, 0.5),
+)
+
 batch_size = 16
 shuffle = False
 
-unnormalized_scale = armory.data.Scale(
-    dtype=armory.data.DataType.UINT8,
-    max=255,
-)
-
-mstar_dataloader = armory.dataset.ImageClassificationDataLoader(
-    mstar_dataset['train'],
+dataloader = armory.dataset.ImageClassificationDataLoader(
+    armory_dataset,
     dim=armory.data.ImageDimensions.CHW,
-    scale=unnormalized_scale,
+    scale=normalized_scale,
     image_key="image",
     label_key="label",
     batch_size=batch_size,
     shuffle=shuffle,
 )
 
-armory_dataset = armory.evaluation.Dataset(
-    name="MSTAR-qpm-real",
-    dataloader=mstar_dataloader,
+evaluation_dataset = armory.evaluation.Dataset(
+    name="food-101",
+    dataloader=dataloader,
 )
 ```
 
 ## Hugging Face
 
-To demonstrate a new Hugging Face dataset, we load the [VisDrone2019 dataset](https://github.com/VisDrone/VisDrone-Dataset) object detection dataset. The VisDrone2019 dataset, created by the AISKYEYE team at Tianjin University, China, includes 288 video clips and 10,209 images from various drones, providing a comprehensive benchmark with over 2.6 million manually annotated bounding boxes for objects like pedestrians and vehicles across diverse conditions and locations.
+To demonstrate a new Hugging Face dataset, we load the [VisDrone2019 dataset](https://github.com/VisDrone/VisDrone-Dataset) object detection dataset.
+The VisDrone2019 dataset, created by the AISKYEYE team at Tianjin University, China, includes 288 video clips and 10,209 images from various drones,
+providing a comprehensive benchmark with over 2.6 million manually annotated bounding boxes for objects like pedestrians and vehicles across diverse
+conditions and locations.
 
-As a first step, we download the [validation split](https://drive.google.com/file/d/1bxK5zgLn0_L8x276eKkuYA_FzwCIjb59/view?usp=sharing) to a temporary directory. Note that we do not need to unzip the archive for processing as a Hugging Face dataset.
+As a first step, we download the [validation split](https://drive.google.com/file/d/1bxK5zgLn0_L8x276eKkuYA_FzwCIjb59/view?usp=sharing) to a temporary directory.
+Note that we do not need to unzip the archive for processing as a Hugging Face dataset.
 ```python
 tmp_dir = Path('/tmp')
 visdrone_dir = tmp_dir / Path('visdrone_2019')
 visdrone_dir.mkdir(exist_ok=True)
 
 visdrone_val_zip = visdrone_dir / Path('VisDrone2019-DET-val.zip')
 ```
-The VisDrone 2019 Task 1 dataset is organized as parallel folders of images and annotations containing pairs of image and annotation files, respectively. We then need to designate the object categories and name the fields in the annotation files.
+The VisDrone 2019 Task 1 dataset is organized as parallel folders of images and annotations containing pairs of image and annotation files, respectively.
+We then need to designate the object categories and name the fields in the annotation files.
 ```python
 CATEGORIES = [
     'ignored',