Skip to content

Standardization of the datasets #1080

Open
@pmeier

Description

@pmeier

This is a discussion issue which was kicked of by #1067. Some PRs that contain ideas are #1015 and #1025. I will update this comment regularly with the achieved consensus during the discussion.

Disclaimer: I have never worked with segmentation or detection datasets. If I make same wrong assumption regarding them, feel free to correct me. Furthermore, please help me to fill in the gaps.


Proposed Structure

This issues presents the idea to standardize the torchvision.datasets. This could be done by adding parameters to the VisionDataset (split) or by subclassing it and add task specific parameters (classes or class_to_idx) to the new classes. I imagine it something like this:

import torch.utils.data as data

class VisionDataset(data.Dataset):
    pass

class ClassificationDataset(VisionDataset):
    pass

class SegmentationDataset(VisionDataset):
    pass

class DetectionDataset(VisionDataset):
    pass

For our tests we could then have a generic_*_dataset_test as is already implement for ClassificationDatasets.


VisionDataset

  • As discussed in Standardisation of Dataset API split argument name #1067 we could unify the argument that selects different parts of the dataset. IMO split as a str is the most general, but still clear term for this. I would implement this as positional argument within the constructor. This should work for all datasets, since in order to be useful each dataset should have at least a training and a test split. Exceptions to this are the Fakedata and ImageFolder datasets, which will be discussed separately.

  • IMO every dataset should have a _download method in order to be useful for every user of this package. We could have the constructor have download=True as keyword argument and call the download method within it. As above, the Fakedata and ImageFolder datasets will be discussed below.


Fakedata and ImageFolder

What makes these two datasets special, is that there is nothing to download and they are not splitted in any way. IMO they are not special enough to not generalise the VisionDataset as stated above. I propose that we simply remove the split and download argument from their constructor and raise an exception if someone calls the download method.

Furthermore the Fakedata dataset is currently a ClassificationDataset. We should also create a FakeSegmentationData and a FakeDetectionData dataset.


ClassificationDataset

The following datasets belong to this category: CIFAR*, ImageNet, *MNIST, SVHN, LSUN, SEMEION, STL10, USPS, Caltech*

  • Each dataset should return PIL.Image, int if indexed
  • Each dataset should have a classes parameter, which is a tuple with all available classes in human-readable form
  • Currently, some datasets have a class_to_idx parameter, which is dictionary that maps the human-readable class to its index used as target. I propose to change the direction, i.e. create a idx_to_class parameter, since IMO this is the far more common transformation.

SegmentationDataset

The following datasets belong to this category: VOCSegmentation


DetectionDataset

The following datasets belong to this category: CocoDetection, VOCDetection


ToDo

  • The following datasets need sorting into the three categories: Caltech101, Caltech256,
    CelebA, CityScapes, Cococaptions, Flickr8k, Flickr30k, LSUN, Omniglot, PhotoTour, SBDataset (shouldn't this be just called SBD?), SBU, SEMEION, STL10, and USPS
  • Add some common arguments / parameters for the SegmentationDataset and DetectionDataset

Thoughts and suggestions?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions