Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] How do we want to handle torchvision.prototype.features.Feature's? #5045

Open
pmeier opened this issue Dec 7, 2021 · 19 comments

Comments

@pmeier
Copy link
Collaborator

pmeier commented Dec 7, 2021

This issue should spark a discussion about how we want to handle Feature's in the future. There are a lot of open questions I'm trying to summarize. I'll give my opinion to each of them. You can find the current implementation under torchvision.prototype.features.

What are Feature's?

Feature's are subclasses of torch.Tensor and their purpose is threefold:

  1. With their type, e.g. Image, they information about the data they carry. The prototype transformations (torchvision.prototype.transforms) use this information to automatically dispatch an input to the correct kernel.
  2. They can optionally carry additional meta data that might be needed for transforming the feature. For example, most geometric transformations can only be performed on bounding boxes if the size of the corresponding image is known.
  3. They provide a convenient interface for feature specific functionality, for example transforming the format of a bounding box.

There are currently three Feature's implemented

  • Image,
  • BoundingBox, and
  • Label,

but in the future we should add at least three more:

  • SemanticSegmentationMask,
  • InstanceSegementationMask, and
  • Video.

What is the policy of adding new Feature's?

We could allow subclassing of Feature's. On the one hand, this would make it easier for datasets to conveniently bundle meta data. For example, the COCO dataset could return a CocoLabel, which in addition to the default Label.category could also have the super_category field. On the other hand, this would also mean that the transforms need to handle subclasses of features well, for example a CocoLabel could be treated the same as a Label.

I see two downsides with that:

  1. What if a transform needs the additional meta data carried by a feature subclass? Imagine I've added a special transformation that needs CocoLabel.super_category. Although from the surface this now supports plain Label's this will fail at runtime.
  2. Documentation custom features is more complicated than documenting a separate field in the sample dictionary of a dataset.

Thus, I'm leaning towards only having a few base classes.

From what data should a Feature be instantiable?

Some of the features like Image or Video have non-tensor objects that carry the data. Should these features know how to handle them? For example should something like Image(PIL.Image.open(...)) work?

My vote is out for yes. IMO this is very convenient and also not an unexpected semantic compared to passing the data directly, e.g. Image(torch.rand(3, 256, 256))

Should Feature's have a fixed shape?

Consider the following table:

Feature .shape
Image (*, C, H, W)
Label (*)
BoundingBox (*, 4)
SemanticSegmentationMask (*, H, W) or (*, C, H, W)
InstanceSegementationMask (*, N, H, W)
Video (*, T, C, H, W)

(For SemanticSegmentationMask I'm not sure about the shape yet. Having an extra channel dimension makes the tensor unnecessarily large, but it aligns well with segmentation image files, which are usually stored as RGB)

Should we fix the shape to a single feature, i.e. remove the * from the table above, or should we only care about the shape in the last dimensions to be correct?

My vote is out for having a flexible shape, since otherwise batching is not possible. For example, if we fix bounding boxes to shape (4,) a transformation would need to transform N bounding boxes individually, while for shape (N, 4) it could make use of parallelism.

On the same note, if we go for the flexible shape, do we keep the singular name of the feature? For example, do we regard a batch of images with shape (B, C, H, W) still as Image or should we go for the plural Images in general? My vote is out for always keeping the singular, since I've often seen something like:

for image, target in data_loader(dataset, batch_size=4):
    ...

Should Feature's have a fixed dtype?

This makes sense for InstanceSegementationMask which should always be torch.bool. For all the other features I'm unsure. My gut says to use a default dtype, but also allow other dtypes.

What meta data should Feature's carry?

IMO, this really depends on the decision above about the fixed / flexible shapes. If we go for fixed shapes, it can basically carry any information. If we go for flexible shapes instead, we should only have meta data, which is the same for batched features. For example, BoundingBox.image_size is fine, but Label.category is not.

What methods should Feature's provide?

For now I've only included typical conversion methods, but of course this is not exhaustive.

Feature method(s)
Image .to_dtype()
.to_colorspace()
Label .to_str()
BoundingBox .to_format()
InstanceSegementationMask .to_semantic()

cc @bjuncek

@vadimkantorov
Copy link

vadimkantorov commented Dec 7, 2021

about naming: features and Feature is a very loaded term and will also be hard to google

my only comment on the substance (which I had already voiced in other issues and which seems to go against the made decisions): please also create "plumbing" functions that accept raw tensors. this will benefit all users even if the abstractions aren't got right or aren't flexible enough (similar problem of not getting the abstractions right has happened with lr schedulers pytorch/pytorch#68332) and will help in users' existing codebases

@pmeier
Copy link
Collaborator Author

pmeier commented Dec 8, 2021

@vadimkantorov

about naming: features and Feature is a very loaded term and will also be hard to google

I agree 100%, but so far no one (including me) was able to come up with a better name. Do you have one?

please also create "plumbing" functions that accept raw tensors.

Could you elaborate on that? What functions are you talking about? The prototype transforms will accept vanilla tensors as inputs and treat them as Image's for BC.

@vadimkantorov
Copy link

vadimkantorov commented Dec 8, 2021

any name even more obscure will be better than "feature", even VisionFeature in one word, VisionTrait, VisionModality, anything is better than features.feature.

I mean that to_colorspace and functions for working with masks, bboxes etc should also exist as free functions taking raw tensors.

@pmeier
Copy link
Collaborator Author

pmeier commented Dec 8, 2021

@vadimkantorov

I mean that to_colorspace and functions for working with masks, bboxes etc should also exist as free functions taking raw tensors.

In the current design these are not functions, but rather methods on the new classes. Since the method has access to the meta data, there is no point to allow it to be passed as additional keyword argument.

One solution to support this use case is to always have a method / function pair. For example:

class Image(torch.Tensor):
    ...

    @staticmethod
    def convert_colorspace(image: torch.Tensor, *, old_color_space, new_color_space) -> torch.Tensor:
        ...

    def to_colorspace(self, new_color_space) -> Image:
        return Image(
            self.convert_colorspace(self, old_color_space=self.color_space, new_color_space=new_color_space),
            like=self,
            color_space=new_color_space,
        )

With this both variants are supported:

data = torch.rand(3, 256, 256)
Image.convert_colorspace(data, old_color_space="rgb", new_color_space="hsv")
Image(data, color_space="rgb").to_colorspace("hsv")

@vadimkantorov
Copy link

vadimkantorov commented Dec 8, 2021

Yep, in my opinion it's best to always have a free function even as torchvision.functional.images.convert_colorspace, same for bboxes and masks. IMO the field is still moving fast and there are many representations for the same concepts that would break the design (also, could change the design nested tensors, masked tensors or any other quasi-type extensions for core PyTorch tensors like carrying some meta-information hints along with a tensor; tensor subclass preserving is still limited in core pytorch) and require constant wrapping / unwrapping into these objects for interop with existing models. On the other hand, free functions accepting raw tensors and arguments will always be helpful and reusable regardless of needed future refactorings.

@oke-aditya
Copy link
Contributor

My thought was naming Feature as Structure. In this way we match the detectron2 names. https://github.com/facebookresearch/detectron2/tree/main/detectron2/structures

@fmassa
Copy link
Member

fmassa commented Dec 14, 2021

Thanks for raising this discussion.

Here are my thoughts on the questions:

From what data should a Feature be instantiable?

I wouldn't overload the constructor to support things like Image(PIL.Image(...)). At best I would provide a classmethod Image.from_pil_image, but this can be very well as well a free function.

Let it be free functions (from @vadimkantorov )

We are considering redesigning the transforms to use the dispatch from PyTorch, and to support torchscript we would need to support free functions with raw tensors. So @vadimkantorov proposal would be covered in that scenario.

What methods should Feature's provide?

Maybe instead of providing methods we should only have free functions, that take custom tensors as input. This way we don't go too far away from what raw Tensors already are. So point 3 from section What are Feature's? would maybe be removed.

What is the policy of adding new Feature's?

I think each Feature instantiations should have clear semantics wrt the functions it will support.
IMO we shouldn't think of Feature as just tensor with (random) attributes, but instead as entities with meaning for which each operation has a well defined meaning.

What meta data should Feature's carry?

My first thought would be to carry as little extra metadata as possible. In the end, each Feature is just a Tensor (and should be able to be used as a regular Tensor!), but with some little enhancements for better usability.

Should Feature's have a fixed shape?

Let's think of this wrt how functions should behave on Features. If our functions support multiple shapes (e.g., for batches), and only on those cases, then I would say it makes sense to impose some constraints on the dimensionality.

Should Feature's have a fixed dtype?

Most probably not? Features are Tensors. The users can decide what they want to do with them, but the difference being that some operations behave different on them compared to normal Tensors. So for SegmentationMask, I would most probably only enable nearest interpolation for the functions acting on it, otherwise all the rest would be the same.

For example, should images be uint8 or float? It might depend on where we are in the pipeline, but it should still be an image, right?

@vadimkantorov
Copy link

vadimkantorov commented Dec 14, 2021

Two other arguments to consider: there are already similar abstractions in mmdetection, detectron2 and other frameworks. They're unlikely to migrate to new abstractions (or at least to it fast or completely), but probably can still reuse and call the "plumbing" functions accepting raw tensors with some format arguments thus moving more of common / utility functions into torchvision.

For masks etc, there may be many representations, e.g. if bit tensors are implemented or some sort of primitive based ones that require some rendering (RLE-based (seems already a prominent format for large datasets or polygon rendering based or DCT-basis based), so if you end up introducing some classes, please name them more specifically, not with a generic name, as more efficient representation may appear and later there would be a naming issue. This applies to both classes and "mode" / "format" arguments. Also, some replacement of abstractions could be very specific/standard argument names for raw tensors, e.g. model.forward(float_image_01 = my_img). Can even make many arguments keyword-only, so it always has to be retyped. This will appear in IDE tooltips and will stick in people minds.

For example, should images be uint8 or float? It might depend on where we are in the pipeline, but it should still be an image, right?

Because of problems like this, IMO image class abstraction/feature is too leaky and not very useful :(

@nairbv
Copy link
Contributor

nairbv commented Feb 9, 2022

re the name "Feature," aside from being an overloaded term, is it also incorrect? Usually a label is not a feature -- I would think a Label subclassing Feature would imply self-supervision.

I want to suggest "example," but aside from also being overloaded, I think that only works if it's a feature and label. Here I think we need a term that can refer to either a feature or label, but not both.

@nairbv
Copy link
Contributor

nairbv commented Feb 10, 2022

some other thoughts:

  • IIUC the proposed class hierarchy is Image/Label/BoundingBox/Video/etc extend Feature, and Feature extends Tensor. I'm having trouble understanding what Feature adds or what functionality it should generically provide vs if each type directly extended Tensor?
  • If other domains try to solve similar issues in similar ways (e.g. an audio tensor), what are the implications of having multiple Feature classes from multiple domains, particularly when mixed in multi-modal problems? Is this something we should try to coordinate across domains? e.g. If torchvision demuxed an audio stream from a video, it would be awkward if we later wrapped a torchvision.AudioTensor in a torchaudio.AudioTensor, or some multi-modal code needed to check isintance Feature but needed fully-qualified domain-specific Feature classes.
  • The singular Image suggests that there could be per-image metadata. Since there are batch dimensions, should class names be plural (Images)?
  • Are tensor subclasses scriptable? Assuming not, what limitations does that impose?

@datumbox
Copy link
Contributor

datumbox commented Feb 10, 2022

@nairbv Thanks for the feedback.

I'll only touch a few points from what you discuss and leave the rest to @pmeier to provide his thoughts as he is more familiar with why some of the choices were originally made:

  • The code you see on main branch is not the most up-to-date. The best branch for this work is available here. Phillip is working to merge on main soon. We treat this as top priority to keep the main branch as clean as possible. So some of your concerns about its functionality might be addressed by looking the specific branch.
  • The Feature class is an internal implementation detail for TorchVision. It is not exposed to the public users. Note that the name is a placeholder (we are trying to find a better name). Other subtypes of the class such as Images, BoundingBoxes etc are exposed but these are vision specific.
  • The Feature subclasses are not JIT-scritable. To address this we have two types of kernels: the high-level ones that can make use of meta-data (not JIT scriptable) and the low-level that actually perform the operations and receive the meta-data explicitly (JIT scriptable). Moreover to avoid pushing the majority of our JIT users to use low-level kernels, we offer presets (part for the Multi-weight API) which uses the low-level kernels and guarantees JIT-scriptability. You can see more about how this work on this unmerged PR add automatic feature type dispatch to functional transforms #5323.

@vadimkantorov
Copy link

vadimkantorov commented Feb 10, 2022

Maybe you already did it, but it would be very nice to have a public workshop / seminar on vision framework design / architecture. It would be very interesting to hear what authors of mmdetection / detectron2 / kornia / albumentations are thinking about their architecture choices and standing problems (especially related to augmentations). Microsoft's .NET community is doing a lot of streaming of their design meetings.

I've already ranted about it, so please feel free to ignore the following paragraph. Currently, I'm having an impression, that we're having a bunch of frameworks with similar goals, but the end result for the user is that it's quite cumbersome to mix these frameworks. The frameworks are becoming quite coupled, it's hard to use / copy only some parts of the framework for modification without buying into the whole thing. I think torchvision is uniquely positioned to drive simplification and reuse efforts as opposed to being yet another framework formed by immediate goals of the authors (supporting company's research-to-production cycles, or jn-lab framework for PhD students to work on together, etc)

@datumbox
Copy link
Contributor

@vadimkantorov Sounds like a good topic for the next PyTorch dev day. Note that we already invite lots of creators from the ecosystem, so there are opportunities for such workshops.

Concerning coordination across frameworks, we do engage with creators from the PyTorch eco at least once each Half and we try to improve the visibility of what we do with direct engagements, blogposts and feedback requests. Though I agree we are not at optimum level yet and we can still improve, we put over the last year lots of efforts to improve visibility to the community on what the plan is, engage with them early and involve people more actively in the development and design. We do acknowledge that we can do better and we plan to do it though, so thanks for the feedback. A final key thing to note here is that not all libraries have the same focus or optimize for the same things. Some explicitly say "we don't focus in production", some focus on giving "cutting edge implementations of the latest models" and TorchVision focuses on canonical and extendable implementations. This means that even though we do want to align as much as possible with other libraries, potentially upstream things that duplicate effort etc, sometimes the creators of each library just decide to follow the best approach for their project. This is part of the beauty of open-source and gives to end users the option to choose the library that matches better their needs.

@vadimkantorov
Copy link

vadimkantorov commented Feb 10, 2022

That's why I'm not very keen of subclassing approach - especialy in the core library and especially wrt core stuff like images, videos, bboxes, masks. I think the upstreaming of plumbing primitives (as started by box_convert function and with transforms.functional) should be done first, it's easier to design (and to deprecate if needed) and it's simpler to get feedback on these purely functional components and it will already help to replace a lot of code in these other frameworks (regardless of frameworks stated goals).

@nairbv
Copy link
Contributor

nairbv commented Feb 10, 2022

it's quite cumbersome to mix these frameworks ... it's hard to use / copy only some parts of the framework for modification without buying into the whole thing ... I think torchvision is uniquely positioned to drive simplification and reuse efforts

I think your comment gets at the motivation behind some of the questions I ask. e.g. I think it's significant, for example, that we don't force users to extend some kind of TorchvisionModule or LightningModule (though of course you can use torchvision models in lightning). Many vision frameworks already extend and build frameworks on top of torchvision, and we wouldn't want to inhibit that, so need to be careful about our abstractions.

That also has to be balanced with what functionality we want to offer too though, e.g. it would be more convenient to be able to support non-RGB images (YUV from videos) without extra parameters on every image operator.

@vadimkantorov
Copy link

I was talking about data abstractions in the first place: images, videos, bboxes.

Module abstractions is another thing. It's also quite nasty. If you look at detectron2 / mmdet aug pipelines, there are a lot of levels of indirection and inheritance (composition is less nasty), from outside it's hard to easily understand what's going on, hard to copy only one piece of code.

But nastiness of subclasses for data abstractions is much higher IMO (and even more as long as you mix code using different frameworks). For conversions, subclasses aren't mandatory, clear naming, well-written examples and documentation are sufficient IMO.

@pmeier
Copy link
Collaborator Author

pmeier commented Feb 10, 2022

@vadimkantorov

I think we already have what you want. Citing @datumbox from #5045 (comment)

The Feature subclasses are not JIT-scritable. To address this we have two types of kernels: the high-level ones that can make use of meta-data (not JIT scriptable) and the low-level that actually perform the operations and receive the meta-data explicitly (JIT scriptable)

The low-level functions will work with vanilla tensors, so you don't have to use the new abstractions if you don't want. Tentative plan is to expose them from torchvision.transforms.kernels. Have a look at #5323 or the underlying branch https://github.com/pmeier/vision/tree/transforms/dispatch/torchvision/prototype/transforms/kernels.

@vadimkantorov
Copy link

Also, a naming sujective feedback: namespaces names like transforms / kernels / ops are quite vague (or even vacuous meaning) :( I wonder if vision functions could be grouped into namespaces according to their functionality or just pulled up to the top namespace, like we have for core PyTorch (and even probably .functional namespace is remnant of separation into tensors/variables that used to exist, so some functions like sigmoid are now pulled up to main torch namespace)

@datumbox
Copy link
Contributor

datumbox commented Feb 11, 2022

@vadimkantorov Naming is HARD and we can use some good ideas. :)

We are trying to prototype and iterate as quickly as possible to have an initial good proposal. Many of the things you mentioned here should be addressed on the final plan. Once we have a bit more complete solution, we will create a Feedback issue and ask for input from the community. We've done this before (see the Multi-weights API feedback issue) and we will proceed to make the necessary changes to make our solution better based on proposals from the community. So keep an eye out for that, but I'm also happy to ping you once it's out.

Rolling out such changes will be done gradually and only after we ensure we have a good solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants