Skip to content

Commit

Permalink
Update new_dataset_to_armory.md
Browse files Browse the repository at this point in the history
Signed-off-by: Etienne Deprit <etienne.deprit@twosixtech.com>
  • Loading branch information
deprit authored Nov 19, 2024
1 parent 75b32ee commit 58392d1
Showing 1 changed file with 64 additions and 62 deletions.
126 changes: 64 additions & 62 deletions docs/new_dataset_to_armory.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,18 +2,74 @@

In this file, I will show 2 different examples of how to add a dataset into armory-library.

## Torchvision

For a Torchvision dataset, we load the dataset using the `ImageFolder` dataset builder, which automatically infers the class labels based on the directory names.
```python
data_dir = sample_dir / Path("png_images", "qpm", "real")
raw_dataset = datasets.load_dataset('imagefolder', data_dir=data_dir)
```

Next, we define train, validation, and test splits.
```python
train_dataset = raw_dataset['train'].train_test_split(
test_size=3/10,
stratify_by_column='label'
)

test_dataset = train_dataset['test'].train_test_split(
test_size=2/3,
stratify_by_column='label'
)

mstar_dataset = datasets.DatasetDict(
{
'train': train_dataset['train'],
'valid': test_dataset['train'],
'test': test_dataset['test']
}
)
```

Last, we integrate the dataset into Armory.
```python
batch_size = 16
shuffle = False

unnormalized_scale = armory.data.Scale(
dtype=armory.data.DataType.UINT8,
max=255,
)

mstar_dataloader = armory.dataset.ImageClassificationDataLoader(
mstar_dataset['train'],
dim=armory.data.ImageDimensions.CHW,
scale=unnormalized_scale,
image_key="image",
label_key="label",
batch_size=batch_size,
shuffle=shuffle,
)

armory_dataset = armory.evaluation.Dataset(
name="MSTAR-qpm-real",
dataloader=mstar_dataloader,
)
```

## Hugging Face

For a Hugging Face dataset, we first need to download a zip of the dataset's validation split.
To demonstrate a new Hugging Face dataset, we load the [VisDrone2019 dataset](https://github.com/VisDrone/VisDrone-Dataset) object detection dataset. The VisDrone2019 dataset, created by the AISKYEYE team at Tianjin University, China, includes 288 video clips and 10,209 images from various drones, providing a comprehensive benchmark with over 2.6 million manually annotated bounding boxes for objects like pedestrians and vehicles across diverse conditions and locations.

As a first step, we download the [validation split](https://drive.google.com/file/d/1bxK5zgLn0_L8x276eKkuYA_FzwCIjb59/view?usp=sharing) to a temporary directory. Note that we do not need to unzip the archive for processing as a Hugging Face dataset.
```python
tmp_dir = Path('/tmp')
visdrone_dir = tmp_dir / Path('visdrone_2019')
visdrone_dir.mkdir(exist_ok=True)

visdrone_val_zip = visdrone_dir / Path('VisDrone2019-DET-val.zip')
```

We then need to designate the object categories and name the fields in the annotation files.
The VisDrone 2019 Task 1 dataset is organized as parallel folders of images and annotations containing pairs of image and annotation files, respectively. We then need to designate the object categories and name the fields in the annotation files.
```python
CATEGORIES = [
'ignored',
Expand Down Expand Up @@ -42,7 +98,7 @@ ANNOTATION_FIELDS = [
]
```

Next, we define the possibly hierarchical features of the dataset by instantiating a [`datasets.Features`](https://huggingface.co/docs/datasets/v2.19.0/en/package_reference/main_classes#datasets.Features) object -- each feature is named and a Hugging Face data type provided.
Next, we define the hierarchical features of the dataset by instantiating a [`datasets.Features`](https://huggingface.co/docs/datasets/v2.19.0/en/package_reference/main_classes#datasets.Features) object -- each feature is named and a Hugging Face data type provided.
```python
features = datasets.Features(
{
Expand All @@ -62,7 +118,9 @@ features = datasets.Features(
)
```

We additionally need to define functions `load_annotations` and `generate_examples`. The `load_annotations` function take a reader for an annotation file, parses each object description into a dictionary and returns a list of object descriptors.
We additionally need to define functions `load_annotations` and `generate_examples`. The `load_annotations` function takes a reader for an annotation file, parses an image description into a dictionary and returns the dictionary of image features. The `generate_examples` generator function uses the specified file reader to iterate over the image in dataset archive. For each image, the generator reads the image file bytes and parses
the associated annotation.

```python
def load_annotations(f: io.BufferedReader) -> List[Dict]:
reader = csv.DictReader(io.StringIO(f.read().decode('utf-8')), fieldnames=ANNOTATION_FIELDS)
Expand Down Expand Up @@ -99,7 +157,7 @@ def generate_examples(files: Iterator[Tuple[str, io.BufferedReader]], annotation
yield example
```

We can then create the validation dataset by calling [`datasets.Dataset.from_generator`](https://huggingface.co/docs/datasets/v2.19.0/en/package_reference/main_classes#datasets.Dataset.from_generator).
We can now create the validation dataset by calling [`datasets.Dataset.from_generator`](https://huggingface.co/docs/datasets/v2.19.0/en/package_reference/main_classes#datasets.Dataset.from_generator) with the generator above.
```python
visdrone_val_files = datasets.DownloadManager().iter_archive(visdrone_val_zip)

Expand All @@ -112,59 +170,3 @@ visdrone_dataset = datasets.Dataset.from_generator(

)
```

## Torchvision

For a Torchvision dataset, we load the dataset using the `ImageFolder` dataset builder, which automatically infers the class labels based on the directory names.
```python
data_dir = sample_dir / Path("png_images", "qpm", "real")
raw_dataset = datasets.load_dataset('imagefolder', data_dir=data_dir)
```

Next, we define train, validation, and test splits.
```python
train_dataset = raw_dataset['train'].train_test_split(
test_size=3/10,
stratify_by_column='label'
)

test_dataset = train_dataset['test'].train_test_split(
test_size=2/3,
stratify_by_column='label'
)

mstar_dataset = datasets.DatasetDict(
{
'train': train_dataset['train'],
'valid': test_dataset['train'],
'test': test_dataset['test']
}
)
```

Last, we integrate the dataset into Armory.
```python
batch_size = 16
shuffle = False

unnormalized_scale = armory.data.Scale(
dtype=armory.data.DataType.UINT8,
max=255,
)

mstar_dataloader = armory.dataset.ImageClassificationDataLoader(
mstar_dataset['train'],
dim=armory.data.ImageDimensions.CHW,
scale=unnormalized_scale,
image_key="image",
label_key="label",
batch_size=batch_size,
shuffle=shuffle,
)

armory_dataset = armory.evaluation.Dataset(
name="MSTAR-qpm-real",
dataloader=mstar_dataloader,
)
```

0 comments on commit 58392d1

Please sign in to comment.