Update new_dataset_to_armory.md

Signed-off-by: Etienne Deprit <etienne.deprit@twosixtech.com>
twosixlabs · Nov 19, 2024 · 58392d1 · 58392d1
1 parent 75b32ee
commit 58392d1
Showing 1 changed file with 64 additions and 62 deletions.
diff --git a/docs/new_dataset_to_armory.md b/docs/new_dataset_to_armory.md
@@ -2,18 +2,74 @@
 
 In this file, I will show 2 different examples of how to add a dataset into armory-library.
 
+## Torchvision
+
+For a Torchvision dataset, we load the dataset using the `ImageFolder` dataset builder, which automatically infers the class labels based on the directory names.
+```python
+data_dir = sample_dir / Path("png_images", "qpm", "real")
+raw_dataset = datasets.load_dataset('imagefolder', data_dir=data_dir)
+```
+
+Next, we define train, validation, and test splits.
+```python
+train_dataset = raw_dataset['train'].train_test_split(
+    test_size=3/10,
+    stratify_by_column='label'
+)
+
+test_dataset = train_dataset['test'].train_test_split(
+    test_size=2/3,
+    stratify_by_column='label'
+)
+
+mstar_dataset = datasets.DatasetDict(
+    {
+        'train': train_dataset['train'],
+        'valid': test_dataset['train'],
+        'test': test_dataset['test']
+    }
+)
+```
+
+Last, we integrate the dataset into Armory.
+```python
+batch_size = 16
+shuffle = False
+
+unnormalized_scale = armory.data.Scale(
+    dtype=armory.data.DataType.UINT8,
+    max=255,
+)
+
+mstar_dataloader = armory.dataset.ImageClassificationDataLoader(
+    mstar_dataset['train'],
+    dim=armory.data.ImageDimensions.CHW,
+    scale=unnormalized_scale,
+    image_key="image",
+    label_key="label",
+    batch_size=batch_size,
+    shuffle=shuffle,
+)
+
+armory_dataset = armory.evaluation.Dataset(
+    name="MSTAR-qpm-real",
+    dataloader=mstar_dataloader,
+)
+```
+
 ## Hugging Face
 
-For a Hugging Face dataset, we first need to download a zip of the dataset's validation split. 
+To demonstrate a new Hugging Face dataset, we load the [VisDrone2019 dataset](https://github.com/VisDrone/VisDrone-Dataset) object detection dataset. The VisDrone2019 dataset, created by the AISKYEYE team at Tianjin University, China, includes 288 video clips and 10,209 images from various drones, providing a comprehensive benchmark with over 2.6 million manually annotated bounding boxes for objects like pedestrians and vehicles across diverse conditions and locations.
+
+As a first step, we download the [validation split](https://drive.google.com/file/d/1bxK5zgLn0_L8x276eKkuYA_FzwCIjb59/view?usp=sharing) to a temporary directory. Note that we do not need to unzip the archive for processing as a Hugging Face dataset.
 ```python
 tmp_dir = Path('/tmp')
 visdrone_dir = tmp_dir / Path('visdrone_2019')
 visdrone_dir.mkdir(exist_ok=True)
 
 visdrone_val_zip = visdrone_dir / Path('VisDrone2019-DET-val.zip')
 ```
-
-We then need to designate the object categories and name the fields in the annotation files.
+The VisDrone 2019 Task 1 dataset is organized as parallel folders of images and annotations containing pairs of image and annotation files, respectively. We then need to designate the object categories and name the fields in the annotation files.
 ```python
 CATEGORIES = [
     'ignored',
@@ -42,7 +98,7 @@ ANNOTATION_FIELDS = [
 ]
 ```
 
-Next, we define the possibly hierarchical features of the dataset by instantiating a [`datasets.Features`](https://huggingface.co/docs/datasets/v2.19.0/en/package_reference/main_classes#datasets.Features) object -- each feature is named and a Hugging Face data type provided.
+Next, we define the hierarchical features of the dataset by instantiating a [`datasets.Features`](https://huggingface.co/docs/datasets/v2.19.0/en/package_reference/main_classes#datasets.Features) object -- each feature is named and a Hugging Face data type provided.
 ```python
 features = datasets.Features(
     {
@@ -62,7 +118,9 @@ features = datasets.Features(
 )
 ```
 
-We additionally need to define functions `load_annotations` and `generate_examples`. The `load_annotations` function take a reader for an annotation file, parses each object description into a dictionary and returns a list of object descriptors.
+We additionally need to define functions `load_annotations` and `generate_examples`. The `load_annotations` function takes a reader for an annotation file, parses an image description into a dictionary and returns the dictionary of image features. The `generate_examples` generator function uses the specified file reader to iterate over the image in dataset archive. For each image, the generator reads the image file bytes and parses
+the associated annotation.
+
 ```python
 def load_annotations(f: io.BufferedReader) -> List[Dict]:
     reader = csv.DictReader(io.StringIO(f.read().decode('utf-8')), fieldnames=ANNOTATION_FIELDS)
@@ -99,7 +157,7 @@ def generate_examples(files: Iterator[Tuple[str, io.BufferedReader]], annotation
         yield example
 ```
 
-We can then create the validation dataset by calling [`datasets.Dataset.from_generator`](https://huggingface.co/docs/datasets/v2.19.0/en/package_reference/main_classes#datasets.Dataset.from_generator).
+We can now create the validation dataset by calling [`datasets.Dataset.from_generator`](https://huggingface.co/docs/datasets/v2.19.0/en/package_reference/main_classes#datasets.Dataset.from_generator) with the generator above.
 ```python
 visdrone_val_files = datasets.DownloadManager().iter_archive(visdrone_val_zip)
 
@@ -112,59 +170,3 @@ visdrone_dataset = datasets.Dataset.from_generator(
 
 )
 ```
-
-## Torchvision
-
-For a Torchvision dataset, we load the dataset using the `ImageFolder` dataset builder, which automatically infers the class labels based on the directory names.
-```python
-data_dir = sample_dir / Path("png_images", "qpm", "real")
-raw_dataset = datasets.load_dataset('imagefolder', data_dir=data_dir)
-```
-
-Next, we define train, validation, and test splits.
-```python
-train_dataset = raw_dataset['train'].train_test_split(
-    test_size=3/10,
-    stratify_by_column='label'
-)
-
-test_dataset = train_dataset['test'].train_test_split(
-    test_size=2/3,
-    stratify_by_column='label'
-)
-
-mstar_dataset = datasets.DatasetDict(
-    {
-        'train': train_dataset['train'],
-        'valid': test_dataset['train'],
-        'test': test_dataset['test']
-    }
-)
-```
-
-Last, we integrate the dataset into Armory.
-```python
-batch_size = 16
-shuffle = False
-
-unnormalized_scale = armory.data.Scale(
-    dtype=armory.data.DataType.UINT8,
-    max=255,
-)
-
-mstar_dataloader = armory.dataset.ImageClassificationDataLoader(
-    mstar_dataset['train'],
-    dim=armory.data.ImageDimensions.CHW,
-    scale=unnormalized_scale,
-    image_key="image",
-    label_key="label",
-    batch_size=batch_size,
-    shuffle=shuffle,
-)
-
-armory_dataset = armory.evaluation.Dataset(
-    name="MSTAR-qpm-real",
-    dataloader=mstar_dataloader,
-)
-```
-