Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compatibility layer between stable datasets and prototype transforms? #6662

Closed
pmeier opened this issue Sep 28, 2022 · 8 comments · Fixed by #6663
Closed

compatibility layer between stable datasets and prototype transforms? #6662

pmeier opened this issue Sep 28, 2022 · 8 comments · Fixed by #6663

Comments

@pmeier
Copy link
Collaborator

pmeier commented Sep 28, 2022

The original plan was to roll out the datasets and transforms revamp at the same time since they somewhat depend on each other. However, it is becoming more and more likely that the prototype transforms will be finished sooner. Thus, we need some compatibility layer in the meantime. This issue explains how transforms are currently used with the datasets, what will or will not work without a compatibility layer, and how such a compatibility layer might look like.

Status quo

Most of our datasets support the transform and target_transform idiom. These transformations are applied separately to the first and second item of the raw sample returned by the dataset. For classification tasks this usually sufficient although I've never seen a practical use for target_transform:

dataset = torchvision.datasets.ImageFolder(
traindir,
presets.ClassificationPresetTrain(
crop_size=train_crop_size,
interpolation=interpolation,
auto_augment_policy=auto_augment_policy,
random_erase_prob=random_erase_prob,
),
)

However, the separation of the transforms breaks down in case image and label need to be transformed at the same time, e.g. CutMix or MixUp. They are currently applied through a custom collation function for the dataloader:

mixupcutmix = torchvision.transforms.RandomChoice(mixup_transforms)
collate_fn = lambda batch: mixupcutmix(*default_collate(batch)) # noqa: E731

Since these transforms do not work with the standard idioms, they never made it out of our references into the library.

The need to transform input and target at the same time is not a special case for other tasks such as segmentation or detection. Datasets for these tasks support the transforms parameter. It will be called with the complete sample and thus is able to support all use cases.

Since even datasets for the same task have very diverse outputs, there were only two options without revamping the APIs completely:

  1. Unify the datasets outputs on the dataset itself.
  2. Unify the datasets outputs through a compatibility layer.

When this first came up in the past, we went with option 2. On our references we unified the output for a few select datasets for a specific task, so we can apply custom joint transformations to them. Since we didn't want to commit to the interface, neither the minimal compatibility layer nor the transformations made it into the library. Thus, although some of our datasets in theory support joint transformations, the users have to implement them themselves.

Do we need a compatibility layer?

The new transformations support the joint use case out of the box. Meaning, all the custom transformations from our references are now part of the library. Plus, all transformations that previously only supported images, e.g. resizing or padding, now also support bounding boxes, masks and so on.

The information which part of the sample is what kind of type is not communicated through the sample structure, i.e. first element is an image and second one is a mask, but rather through the actual type of the object. We introduced several tensor subclasses that will be rolled out together with the transforms.

By treating simple tensors, i.e. not the new subclasses, as images, the new transformations are full BC1. Thus, if you previously only used the separated transform and target_transform idiom you can continue to do that and the new transforms will not get into your way:

import torch
from torchvision import datasets
from torchvision.prototype import transforms

transform = transforms.Compose(
    [
        transforms.PILToTensor(),
        transforms.Resize(256),
        transforms.CenterCrop(224),
    ]
)
dataset = datasets.ImageNet(..., transform=transform)

image, label = dataset[0]
assert isinstance(image, torch.Tensor)
assert image.shape[-2:] == (224, 224)
assert isinstance(label, int)

The transforms also work out of the box if you want to stick to PIL images:

import PIL.Image
from torchvision import datasets
from torchvision.prototype import transforms

transform = transforms.Compose(
    [
        transforms.Resize(256),
        transforms.CenterCrop(224),
    ]
)
dataset = datasets.ImageNet(..., transform=transform)

image, label = dataset[0]
assert isinstance(image, PIL.Image.Image)
assert image.size == (224, 224)
assert isinstance(label, int)

Although it seems the new transforms can also be used out of the box if the dataset supports the transforms parameter, this unfortunately not the case. While the new datasets will provide the sample parts wrapped into the new tensor subclasses, the old datasets, i.e. the only ones available during the roll-out of the new transforms, do not.

Without the wrapping, the transform does not pick up on bounding boxes and subsequently does not transform them:

import torch
import PIL.Image
from torchvision import datasets
from torchvision.prototype import transforms

transform = transforms.Compose(
    [
        transforms.Resize(256),
        transforms.CenterCrop(224),
    ]
)
dataset = datasets.CocoDetection(..., transforms=transform)

image, target = dataset[0]
assert isinstance(image, PIL.Image.Image)
assert image.size == (224, 224)

assert len(target) == 8

bbox = target[2]["bbox"]
# bounding boxes were not downsized and thus are now out of sync with the image
torch.testing.assert_close([int(coord) for coord in target[2]["bbox"]], [249, 229, 316, 245])

segmentation = target[2]["segmentation"]
# masks were not downsized and thus are now out of sync with the image. Plus, they still encoded and the user has to
# decode them themselves
assert isinstance(segmentation, list) and all(isinstance(item, (int, float)) for item in segmentation)

Masks will be transformed, but without wrapping they will be treated as normal images. This means, by default InterpolationMode.BILINEAR is used for interpolation, which will corrupt the information:

import torch
from torchvision import datasets
from torchvision.prototype import transforms

transform = transforms.Compose(
    [
        transforms.PILToTensor(),
        # we convert to float here to make the bilinear interpolation visible
        transforms.ConvertImageDtype(torch.float64),
        transforms.Resize(256),
        transforms.CenterCrop(224),
    ]
)
dataset = datasets.VOCSegmentation(..., transforms=transform)

image, mask = dataset[0]
assert isinstance(image, torch.Tensor)
assert image.shape[-2:] == (224, 224)
assert isinstance(mask, torch.Tensor)
assert mask.shape[-2:] == (224, 224)
# If the interpolation worked correctly, we would only see integer values in the uint8 range of [0, 255]
assert torch.any(torch.fmod(mask * 255, 1) > 0)

Thus, if we don't provide a compatibility layer until our datasets wrap automatically, the prototype transforms don't bring any real benefit to the user of our datasets.

Proposal

I propose to provide a thin wrapper for the datasets that does nothing else than wrapping the returned samples into the new tensor subclasses. This means, that the new object behaves exactly as the dataset as before, but upon accessing an element, i.e. calling __getitem__, we wrap the samples before passing them into the transforms.

from torchvision import datasets
from torchvision.prototype import transforms, features

transform = transforms.Compose(
    [
        transforms.Resize(256),
        transforms.CenterCrop(224),
    ]
)
dataset = datasets.ImageNet(..., transform=transform)
dataset = features.VisionDatasetFeatureWrapper.from_torchvision_dataset(dataset)

image, label = dataset[0]
assert isinstance(image, features.Image)
assert image.image_size == (224, 224)
assert isinstance(label, features.Label)
assert label.to_categories() == "tench, Tinca tinca"

Going back to the segmentation example from above, with the wrapper in place the segmentation mask is now correctly
interpolated with InterpolationMode.NEAREST:

import torch
from torchvision import datasets
from torchvision.prototype import transforms, features

transform = transforms.Compose(
    [
        # we convert to float here to make the bilinear interpolation visible
        transforms.ToDtype(torch.float64, features.Mask),
        transforms.Resize(256),
        transforms.CenterCrop(224),
    ]
)
dataset = datasets.VOCSegmentation(..., transforms=transform)
dataset = features.VisionDatasetFeatureWrapper.from_torchvision_dataset(dataset)

image, mask = dataset[0]
assert isinstance(mask, torch.Tensor)
assert mask.shape[-2:] == (224, 224)
assert not torch.any(torch.fmod(mask * 255, 1) > 0)

In general, the wrapper should not change the structure of the sample unless it is necessary to be able to properly use
the new transformations. For example, the target of COCODetection is a list of dictionaries, in which each
dictionary holds the information for one object. Our models however require a dictionary where the value of the
bounding box key is a (N, 4) tensor, where N is the number of objects. Furthermore, while our basic transform can
work with individual bounding boxes, more elaborate ones that we ported from the reference scripts also require this
format.

Thus, if needed, we also perform this collation inside the wrapper:

import torch
from torchvision import datasets
from torchvision.prototype import transforms, features

transform = transforms.Compose(
    [
        transforms.Resize(256),
        transforms.CenterCrop(224),
    ]
)
dataset = datasets.CocoDetection(..., transforms=transform)
dataset = features.VisionDatasetFeatureWrapper.from_torchvision_dataset(dataset)

image, target = dataset[0]

assert isinstance(image, features.Image)
assert image.shape[-2:] == (224, 224)

bbox = target["bbox"]
assert isinstance(bbox, features.BoundingBox)
assert bbox.shape == (8, 4)
torch.testing.assert_close(bbox[2].int().tolist(), [116, 106, 152, 114])

Furthermore, if the data is in an encoded state, like the masks the COCODetection provides, will be decoded so they can be used directly by the transforms and models:

segmentation = target["segmentation"]
assert isinstance(segmentation, features.Mask)
assert segmentation.shape == (8, 224, 224)

The VisionDatasetFeatureWrapper class in the examples above is implemented as a proof of concept in #6663.

Conclusion

If we don't roll out the new datasets at the same time as the new transformations, the transformations on their own will bring little value to the user. Their whole power can only be unleashed if we add a thin compatibility layer between them and the "old" datasets. I've proposed an, IMO clean, implementation for such a compatibility layer.

cc @vfdev-5 @datumbox @bjuncek

Footnotes

  1. Fully BC for what is discussed here. The only thing that will be BC breaking is that the new transforms will no longer be torch.jit.script'able whereas they were before.

@NicolasHug
Copy link
Member

Thanks for the issue Philip. What is the timeline here? Do we we expect the V2 transforms to be out of prototype by 0.14? Because if not, there's a chance that (at least some) new datasets will be ready by 0.15. If we're sure we'll move the transforms away from prototype before the datasets, then we should also be thinking about

a) what will users need to do when the new datasets become available. Do they remove the wrapper? Ideally they would only need to chance their code once, not twice
b) what happens if we never end up releasing the new datasets?

@pmeier
Copy link
Collaborator Author

pmeier commented Sep 29, 2022

What is the timeline here? Do we we expect the V2 transforms to be out of prototype by 0.14? Because if not, there's a chance that (at least some) new datasets will be ready by 0.15.

I think 0.14 is unlikely given that it is right around the corner, so my guess is 0.15. But I'll let @datumbox comment on that. And indeed, if we roll out together, this discussion is moot.

a) what will users need to do when the new datasets become available. Do they remove the wrapper? Ideally they would only need to chance their code once, not twice

My points below assume that the users actually want to use the features of transforms V2. As explained in my top comment, they are BC and so users don't have to use the proposed compatibility layer if they just want to continue doing what they were doing before and don't have to change anything.

That depends on how we are releasing the datasets V2:

  • The original plan was to load them through a function by their name. This would allow us to keep the classes that build the datasets private and in turn allow the V1 and V2 API exist in the same namespace and thus keeping BC. If we go that route, the users have to change once to use the wrappers and once again when the datasets V1 are removed.
  • Some time ago we changed this to also make the new classes public. Meaning, they will replace the old classes and it will be a hard BC break. Thus, the users will have to change once to use the wrappers and one more time to use the new datasets.

If we don't roll-out at the same time, but want to actually push the new transforms from the time they are no longer prototypes, users probably have to change twice. Depending on if we want to deprecate / remove datasets V1 at all (I remember there was some offline discussion to just keep them around, but not maintain them any longer), users could also get away with one change if they just don't use the datasets V2.

b) what happens if we never end up releasing the new datasets?

I think what we currently call datasets V2 bundles multiple things:

  1. Switching from map-style to iter-style datasets using torchdata
  2. Changing the return type from tuples to dictionaries while also returning more than just the bare minimum
  3. Wrapping the returned data into the new tensor subclasses

Each of these points can somewhat stand on their own. Still, each point is BC breaking, which is why we wanted to release them at once to avoid multiple BC breaks in subsequent versions.

If we decide to walk back on datasets V2 in its current state, we need to decide if we keep parts of it. In some form we need 3. to unleash the power of transforms V2. We could

  1. permanently use a compatibility layer as proposed in this issue. This would keep full BC for datasets V1 and users can opt in if they want to use them with transforms V2. Of course this will mean a worse UX, since users now need to wrap the dataset instead of that happening automatically.
  2. BC break the datasets V1 and wrap the output types in the new tensor subclasses. Note that we don't need go for the dictionary output (2. from above) so this will be not as hard as going for datasets V2 completely.

@datumbox
Copy link
Contributor

I think 0.14 is unlikely given that it is right around the corner, so my guess is 0.15. But I'll let @datumbox comment on that. And indeed, if we roll out together, this discussion is moot.

I can confirm that there is no plan releasing Transforms V2 in 4 weeks. We are pretty much in active development and benchmarking. The API will remain in prototype and we can explore a path to release in Q1. Some parts of the API such as the functional could be released first as they are now fully-BC & JIT-scriptable but the classes aren't, so we need to be very careful on how we roll them out.

@datumbox
Copy link
Contributor

I think the option that @pmeier mentions is viable. Whether or not we will implement it will require a lot of discussion because ideally we would like the new Datasets and Transforms to be rolled out together. So any move that doesn't do that, will hinter the adoption of the solutions and in my eyes is more of a nuclear option.

One alternative workaround for unleashing the power of Transforms if Datasets aren't ready but without massive BC issues is the following. We could create a new FeatureWrapper Transform class that can be configured on the constructor to receive a dictionary that describes how to grab the input from the dataset (name or location in the input etc) and map it to the appropriate _Feature type. This is not perfect as we miss out on meta-data such as the Colour space, the Label categories etc. But it is also self-contained within the new Transforms V2 and pretty much is just a generic solution for what we already do at #6433 to test the transforms.

@pmeier
Copy link
Collaborator Author

pmeier commented Sep 29, 2022

We could create a new FeatureWrapper Transform class that can be configured on the constructor to receive a dictionary that describes how to grab the input from the dataset (name or location in the input etc) and map it to the appropriate _Feature type.

This was my first thought as well but thinking about it more this was more complicated than wrapping the dataset:

  1. To be able to provide a convenient interface for the users, they need to be able to get the right wrapper transform without manually specifying how the sample needs to be wrapped. Without this, more complicated datasets like COCO will be a pain for users to configure. Have a look at Compatibility layer between stable datasets and prototype transforms #6663 and how much dense logic is needed to wrap the COCODetection samples.

    To be able to provide the wrapping for the user, we need to know what dataset they are using. However the transform needs to be created before the dataset since it will be passed to the constructor and thus resulting in a chicken and egg problem. We can't provide the correct wrapping transform on the dataset class alone, since some datasets change their output type based on some input arguments. Thus, in general, we would need the dataset class as well as all parameters to perform the correct wrapping.

  2. The datasets take either transform, target_transform, or transforms and so we would have to provide three different wrapping transforms.

@pmeier
Copy link
Collaborator Author

pmeier commented Oct 12, 2022

After a longer offline discussion with @datumbox, we agreed it would be beneficial to add a wrapper transform, that needs to be manually specified, in addition to the here proposed dataset wrapper. That can help in the following two use cases:

  1. Users defined their own datasets, but still want to conveniently use the new transforms.
  2. Users use our datasets, but already have some logic in place that brings the data in the right shape and thus only need to wrap the plain tensors into the new subclasses.

The VisionDatasetFeatureWrapper from the PoC implementation in #6663 already supports this, but it would still wrap a dataset. Since we need to specify the wrapping manually anyway, the wrapping can also happen on the transform level and thus not touching the datasets at all.

There are a few new questions that we need to answer now. For illustration purposes, I'm going to use the following detection sample:

sample = (
    torch.rand(3, 512, 512),
    dict(
        area=0.0,
        labels=torch.randint(0, 10, (8,)),
        boxes=torch.rand(8, 4),
    ),
)
  1. How should users specify how the wrapping should take place. I came up with two possible variants:

    1. Mirror the sample structure with the wrapper definition:
    wrappers = ( 
        image_wrapper,
        dict(
            labels=label_wrapper,
            boxes=bounding_box_wrapper,
        ),
    )

    This is what the PoC implementation in Compatibility layer between stable datasets and prototype transforms #6663 does for now.

    1. Specify the indices the wrappers should be applied to
    wrappers = (
        (0, image_wrapper),
        ((1, "labels"), label_wrapper),
        ((1, "boxes"), bounding_box_wrapper),
    )
  2. Do users always need to provide a complete wrapper specification or should we assume that everything not specified will not be wrapped? In 1. above I made that assumption and this is why the area is not handled. If we decide to make this assumption, I would prefer variant ii. from above, since partially mirroring the sample structure might not always be possible. Plus, not wrapping a tensor in the input might lead to the transforms mishandling it, since it will be taken as an image. One option is to wrap this into a no-op feature and this is what the PoC implementation in Compatibility layer between stable datasets and prototype transforms #6663 already does.

  3. How do we want to handle dependent items inside the sample? For example, the bounding_box_wrapper from above needs to know the image size, but with the setup proposed above, it does not have access to it. One way to achieve this, is by writing a wrapper for the whole sample:

    def sample_wrapper(sample):
        image, target = sample
    
        wrapped_image = image_wrapper(image)
        image_size = wrapped_image[-2:]
    
        target["labels"] = label_wrapper(target["labels"])
        target["boxes"] = bounding_box_wrapper(target["boxes"], image_size=image_size)
    
        return wrapped_image, target
    
    # Variant i.
    wrappers_mirror = sample_wrapper
    
    # Variant ii.
    wrappers_indices = [((), sample_wrapper)]

    This means, users will need to write the wrapper manually, and thus they are not able to use the "syntax sugar" we introduced in 1. to specify the transformation. Unfortunately, the datasets that benefit the most from transforms v2 fall into this category. Since users still have access to the building blocks (image_wrapper, ...) this is of source still easier than writing the wrapper from scratch.

@datumbox
Copy link
Contributor

Good call out for the bounding_box_wrapper use-case. Thanks for raising this. Isn't this counter example a deal breaker? I mean, one can definitely provide a sample_wrapper and individual *_wrapper for each type but what's the benefit of doing this versus just writing their own transform? The whole idea was to provide something fast and easy for them to reduce the amount of code they have to write for the 2 use-cases you described. But if they need to write a custom implementation for that, then it seems to me the Lambda layer can do exactly that. Am I missing something?

@NicolasHug
Copy link
Member

Some quick updates on that after syncing with @pmeier:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants