-
Notifications
You must be signed in to change notification settings - Fork 7.1k
Standardization of the datasets #1080
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks a lot @pmeier for starting this discussion! Here are my general thoughts:
Some more specific thoughts:
|
Agreed.
Question(s): What is the standard way of working with Or is this still in challenge mode? In that case we need to decide if we want to have datasets with a single
What do you mean by that? I would define a
Can you think of a dataset besides
I like the idea of As mentioned in #1086 should we protect all internal parameters (e.g. |
I would like to be very careful about adding some functions in VisionDataset, though TorchText has split/download/splits functions in the abstract dataset (https://github.com/pytorch/text/blob/master/torchtext/data/dataset.py#L16). IMO, a simple abstract dataset will give you more flexibility. We are now testing a new pattern for TorchText. Instead of using abstract dataset (mentioned above), I move some functions to the utils and keep them orthogonal. For example, TorchText.utils.download function. We expect some feedbacks to the new pattern later before we make some significant changes to the abstract dataset. |
Each COCO has a different annotation file for
Some datasets do not define a train / test split. So it's up to the user to decide how to split it, for example using
I mean, sometimes the user need to login / authenticate to download certain datasets (see #1042 (comment) for an example), so we wouldn't be able to provide a
Maybe, but I still want to double-check a bit before. cc @aadcock what are your thoughts on this in general? |
IMO, in the end it comes down to two philosophies since there are so many different datasets that we cannot foresee every possible option:
I would argue that datasets without def _download():
raise RuntimeError("Dataset cannot be downloaded") To cite the Python Zen:
Right now we roll with option 1. and as a result of the heterogeneity we started this discussion about standardisation. |
@pmeier one core principle of torchvision (and PyTorch as well) is that most things should be easily understandable. So if a user wants to create / modify a new dataset, it is very easy for them to do so by reading the source code, without having to jump over a lot of abstractions. I do think that having the right level of abstractions is a desirable thing, but we should carefully design it so that it is not a blocker for us in the future. And it's easier to add something than to remove it, so before making all datasets have a One could say that ideally, we would want to have a common API like d1 = torchvision.datasets.get('coco', 'train2017', transform, download=True)
d2 = torchvision.datasets.get('voc-detection', 'train', transform)
d3 = torchvision.datasets.get('imagenet', 'train2012', transform)
d4 = torchvision.datasets.get('mnist', 'train', transform)
... all with the same high-level API. But this makes some assumptions, and maybe we might want to enforce those assumptions/limitations at a higher level than the Sure, this doesn't address what attributes each dataset should have, and that's one of the points that I want to discuss in this issue. |
@fmassa IMO, the abstract dataset could include at least data and transforms (field for torchtext) and full implementation of getitem, len. Those components are highly repetitive across most dataset (if my understanding is correct). something like this: `
` We could carefully add more stuffs later on as the abstract class evolves. |
@zhangguanheng66 class VisionDataset(...):
...
def __getitem__(self, index):
image, target = self._images[index], self.targets[index]
if self.transform is not None:
image, target = self.transform(image, target)
return image, target
def __len__(self):
return len(self._images)
@property
def _images(self):
raise NotImplementedError
@property
def _targets(self):
raise NotImplementedError |
@pmeier Sure, I agree. I'm just wondering the pros and cons to include data (e.g. images, targets) in the abstract dataset class. I guess it won't hurt us, right? |
I'm late to the party here and have comments on a few different pieces of this thread. Here are a few thoughts I've collected about things that I see commonly requested:
|
Since there has been no more input for some days, can I regard the proposal of @zhangguanheng66 as consensus and thus add it to the top post? |
@pmeier we are still discussing about this internally, and we still seem to have no consensus on how to structure it yet. |
How does this approach scale to different tasks/datasets with multiple tasks? On Coco, you can do detection, human keypoint detection and segmentation (of either objects, stuff with CocoStuff or everything toghether with panoptic segmentation). Another part of "the Python way" is duck typing: "If it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck". No need to have a common superclass to implement an interface, as long as the required parts are there. |
Adding to the conversation, I also have been trying to come up with a solution such that I can use a similar interface to load different types of data. The structure that I have for now is https://gist.github.com/ahmed-shariff/dbf3bee6fa2bf4924f95ac2b65620386, which I am trying to include in my pipeline. The idea is I am trying to go a step higher. Based on the dataset and how it's used, I'd implement the |
This standardization will make writing code so much cleaner, now it's like every constructor of every dataset has its own params and makes things so unflexible during development. For instance cifar10 has the train param indicating which subset is used while svhn has split. My two cents are to draw inspiration from the flexibility and modularity of the linux system. For instance if for some datasets there's no download possible due to login or other things then allow the user to make their own download method and hook that to the DataLoader. But, the naming convention should be the same across all of them, either that be it |
Any progress on this topic? |
@wolterlw there has been no further progress on this topic for now. As you point out, each task has its own particularities and there is no single abstraction (more specific than just My current thinking is that there should be not much structure that is imposed by default -- the more structure we add, the less flexible we will be. Worse still, trying to add elaborated abstractions to cover many cases will just make the entry-cost to the library higher. One option that I considered at some point was just to let the datasets from torchvision return the raw data as they are stored originally. This gives full flexibility on how the user treats it. Then, each different pipeline / application can have a specific transform / Something in the lines of class MyDataset:
def __getitem__(self, idx):
...
return dict(image=image, image_id=image_id, params=params)
class MyDatasetFormatter:
def __init__(self, dataset):
self.dataset = dataset
def __getitem__(self, idx):
data = self.dataset[idx]
return data['image'], data['params'] One reason to do so is because the output format of the dataset is tightly coupled with the training loop that you have. And because torchvision doesn't provide training loops in the library itself (but only reference training scripts), I haven't moved forward with this Let me know what you think |
@fmassa I agree with that. The code I had shared before actually was built around that idea (in retrospect, that code is a little wonky because I am trying to do alot of things at the same time). How the data is loaded needs to be decoupled from how the data is used by a pipeline. |
Does something along these lines make sense: dataset_foo = DatasetFoo()
foobar_pipeline_train_dataset = BarPipelineDataFormatter(dataset_foo.train_data(), mode=TRAIN)
foobar_pipeline_test_dataset = BarPipelineDataFormatter(dataset_foo.test_data(), mode=TEST) |
@fmassa I've mentioned in #963 that using a Dataset subclass to simply index all the files and offload the rest of the work to Transforms seems to be reasonably flexible. It's unreasonable to spend this much time rewriting pretty much the same functionality. I've started a discussion on PyTorch forum, but it didn't really lead anywhere yet.
Each dataset's README would tell you:
Point being there should be a single place where people can look for dataset wrappers that are ready to be used, because in my practice it takes a lot of time to build those for yourself. Let me know what you think |
I also recently came across an inconsistency between datasets which is slightly annoying, and figured I'd mention it here instead of in a new issue (but happy to do so if it's easier). Currently, the targets of CIFAR-10 and CIFAR-100 are of type list, while the targets of MNIST are of type torch.Tensor (compare the CIFAR code with the MNIST code). It would be good to standardize the types all as torch.Tensor. |
@tyleryzhu You are right that this is inconsistent, but we never aimed for consistency at the attribute level. The |
This is a discussion issue which was kicked of by #1067. Some PRs that contain ideas are #1015 and #1025. I will update this comment regularly with the achieved consensus during the discussion.
Disclaimer: I have never worked with segmentation or detection datasets. If I make same wrong assumption regarding them, feel free to correct me. Furthermore, please help me to fill in the gaps.
Proposed Structure
This issues presents the idea to standardize the
torchvision.datasets
. This could be done by adding parameters to theVisionDataset
(split
) or by subclassing it and add task specific parameters (classes
orclass_to_idx
) to the newclass
es. I imagine it something like this:For our tests we could then have a
generic_*_dataset_test
as is already implement forClassificationDataset
s.VisionDataset
As discussed in Standardisation of Dataset API split argument name #1067 we could unify the argument that selects different parts of the dataset. IMO
split
as astr
is the most general, but still clear term for this. I would implement this as positional argument within the constructor. This should work for all datasets, since in order to be useful each dataset should have at least a training and a test split. Exceptions to this are theFakedata
andImageFolder
datasets, which will be discussed separately.IMO every dataset should have a
_download
method in order to be useful for every user of this package. We could have the constructor havedownload=True
as keyword argument and call thedownload
method within it. As above, theFakedata
andImageFolder
datasets will be discussed below.Fakedata
andImageFolder
What makes these two datasets special, is that there is nothing to download and they are not splitted in any way. IMO they are not special enough to not generalise the
VisionDataset
as stated above. I propose that we simply remove thesplit
anddownload
argument from their constructor and raise an exception if someone calls thedownload
method.Furthermore the
Fakedata
dataset is currently aClassificationDataset
. We should also create aFakeSegmentationData
and aFakeDetectionData
dataset.ClassificationDataset
The following datasets belong to this category:
CIFAR*
,ImageNet
,*MNIST
,SVHN
,LSUN
,SEMEION
,STL10
,USPS
,Caltech*
PIL.Image, int
if indexedclasses
parameter, which is atuple
with all available classes in human-readable formclass_to_idx
parameter, which is dictionary that maps the human-readable class to its index used as target. I propose to change the direction, i.e. create aidx_to_class
parameter, since IMO this is the far more common transformation.SegmentationDataset
The following datasets belong to this category:
VOCSegmentation
DetectionDataset
The following datasets belong to this category:
CocoDetection
,VOCDetection
ToDo
,Caltech101
,Caltech256
CelebA
,CityScapes
,Cococaptions
,Flickr8k
,Flickr30k
,,LSUN
Omniglot
,PhotoTour
,SBDataset
(shouldn't this be just calledSBD
?),SBU
,,SEMEION
, andSTL10
USPS
Add some common arguments / parameters for theSegmentationDataset
andDetectionDataset
Thoughts and suggestions?
The text was updated successfully, but these errors were encountered: