-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Discussion] How do we want to handle torchvision.prototype.features.Feature
's?
#5045
Comments
about naming: my only comment on the substance (which I had already voiced in other issues and which seems to go against the made decisions): please also create "plumbing" functions that accept raw tensors. this will benefit all users even if the abstractions aren't got right or aren't flexible enough (similar problem of not getting the abstractions right has happened with lr schedulers pytorch/pytorch#68332) and will help in users' existing codebases |
I agree 100%, but so far no one (including me) was able to come up with a better name. Do you have one?
Could you elaborate on that? What functions are you talking about? The prototype transforms will accept vanilla tensors as inputs and treat them as |
any name even more obscure will be better than "feature", even VisionFeature in one word, VisionTrait, VisionModality, anything is better than features.feature. I mean that to_colorspace and functions for working with masks, bboxes etc should also exist as free functions taking raw tensors. |
In the current design these are not functions, but rather methods on the new classes. Since the method has access to the meta data, there is no point to allow it to be passed as additional keyword argument. One solution to support this use case is to always have a method / function pair. For example: class Image(torch.Tensor):
...
@staticmethod
def convert_colorspace(image: torch.Tensor, *, old_color_space, new_color_space) -> torch.Tensor:
...
def to_colorspace(self, new_color_space) -> Image:
return Image(
self.convert_colorspace(self, old_color_space=self.color_space, new_color_space=new_color_space),
like=self,
color_space=new_color_space,
) With this both variants are supported: data = torch.rand(3, 256, 256)
Image.convert_colorspace(data, old_color_space="rgb", new_color_space="hsv")
Image(data, color_space="rgb").to_colorspace("hsv") |
Yep, in my opinion it's best to always have a free function even as |
My thought was naming |
Thanks for raising this discussion. Here are my thoughts on the questions:
I wouldn't overload the constructor to support things like
We are considering redesigning the transforms to use the dispatch from PyTorch, and to support torchscript we would need to support free functions with raw tensors. So @vadimkantorov proposal would be covered in that scenario.
Maybe instead of providing methods we should only have free functions, that take custom tensors as input. This way we don't go too far away from what raw Tensors already are. So point 3 from section What are Feature's? would maybe be removed.
I think each
My first thought would be to carry as little extra metadata as possible. In the end, each Feature is just a Tensor (and should be able to be used as a regular Tensor!), but with some little enhancements for better usability.
Let's think of this wrt how functions should behave on Features. If our functions support multiple shapes (e.g., for batches), and only on those cases, then I would say it makes sense to impose some constraints on the dimensionality.
Most probably not? Features are Tensors. The users can decide what they want to do with them, but the difference being that some operations behave different on them compared to normal Tensors. So for For example, should images be |
Two other arguments to consider: there are already similar abstractions in mmdetection, detectron2 and other frameworks. They're unlikely to migrate to new abstractions (or at least to it fast or completely), but probably can still reuse and call the "plumbing" functions accepting raw tensors with some format arguments thus moving more of common / utility functions into torchvision. For masks etc, there may be many representations, e.g. if bit tensors are implemented or some sort of primitive based ones that require some rendering (RLE-based (seems already a prominent format for large datasets or polygon rendering based or DCT-basis based), so if you end up introducing some classes, please name them more specifically, not with a generic name, as more efficient representation may appear and later there would be a naming issue. This applies to both classes and "mode" / "format" arguments. Also, some replacement of abstractions could be very specific/standard argument names for raw tensors, e.g. model.forward(float_image_01 = my_img). Can even make many arguments keyword-only, so it always has to be retyped. This will appear in IDE tooltips and will stick in people minds.
Because of problems like this, IMO image class abstraction/feature is too leaky and not very useful :( |
re the name "Feature," aside from being an overloaded term, is it also incorrect? Usually a label is not a feature -- I would think a Label subclassing Feature would imply self-supervision. I want to suggest "example," but aside from also being overloaded, I think that only works if it's a feature and label. Here I think we need a term that can refer to either a feature or label, but not both. |
some other thoughts:
|
@nairbv Thanks for the feedback. I'll only touch a few points from what you discuss and leave the rest to @pmeier to provide his thoughts as he is more familiar with why some of the choices were originally made:
|
Maybe you already did it, but it would be very nice to have a public workshop / seminar on vision framework design / architecture. It would be very interesting to hear what authors of mmdetection / detectron2 / kornia / albumentations are thinking about their architecture choices and standing problems (especially related to augmentations). Microsoft's .NET community is doing a lot of streaming of their design meetings. I've already ranted about it, so please feel free to ignore the following paragraph. Currently, I'm having an impression, that we're having a bunch of frameworks with similar goals, but the end result for the user is that it's quite cumbersome to mix these frameworks. The frameworks are becoming quite coupled, it's hard to use / copy only some parts of the framework for modification without buying into the whole thing. I think torchvision is uniquely positioned to drive simplification and reuse efforts as opposed to being yet another framework formed by immediate goals of the authors (supporting company's research-to-production cycles, or jn-lab framework for PhD students to work on together, etc) |
@vadimkantorov Sounds like a good topic for the next PyTorch dev day. Note that we already invite lots of creators from the ecosystem, so there are opportunities for such workshops. Concerning coordination across frameworks, we do engage with creators from the PyTorch eco at least once each Half and we try to improve the visibility of what we do with direct engagements, blogposts and feedback requests. Though I agree we are not at optimum level yet and we can still improve, we put over the last year lots of efforts to improve visibility to the community on what the plan is, engage with them early and involve people more actively in the development and design. We do acknowledge that we can do better and we plan to do it though, so thanks for the feedback. A final key thing to note here is that not all libraries have the same focus or optimize for the same things. Some explicitly say "we don't focus in production", some focus on giving "cutting edge implementations of the latest models" and TorchVision focuses on canonical and extendable implementations. This means that even though we do want to align as much as possible with other libraries, potentially upstream things that duplicate effort etc, sometimes the creators of each library just decide to follow the best approach for their project. This is part of the beauty of open-source and gives to end users the option to choose the library that matches better their needs. |
That's why I'm not very keen of subclassing approach - especialy in the core library and especially wrt core stuff like images, videos, bboxes, masks. I think the upstreaming of plumbing primitives (as started by box_convert function and with transforms.functional) should be done first, it's easier to design (and to deprecate if needed) and it's simpler to get feedback on these purely functional components and it will already help to replace a lot of code in these other frameworks (regardless of frameworks stated goals). |
I think your comment gets at the motivation behind some of the questions I ask. e.g. I think it's significant, for example, that we don't force users to extend some kind of TorchvisionModule or LightningModule (though of course you can use torchvision models in lightning). Many vision frameworks already extend and build frameworks on top of torchvision, and we wouldn't want to inhibit that, so need to be careful about our abstractions. That also has to be balanced with what functionality we want to offer too though, e.g. it would be more convenient to be able to support non-RGB images (YUV from videos) without extra parameters on every image operator. |
I was talking about data abstractions in the first place: images, videos, bboxes. Module abstractions is another thing. It's also quite nasty. If you look at detectron2 / mmdet aug pipelines, there are a lot of levels of indirection and inheritance (composition is less nasty), from outside it's hard to easily understand what's going on, hard to copy only one piece of code. But nastiness of subclasses for data abstractions is much higher IMO (and even more as long as you mix code using different frameworks). For conversions, subclasses aren't mandatory, clear naming, well-written examples and documentation are sufficient IMO. |
I think we already have what you want. Citing @datumbox from #5045 (comment)
The low-level functions will work with vanilla tensors, so you don't have to use the new abstractions if you don't want. Tentative plan is to expose them from |
Also, a naming sujective feedback: namespaces names like transforms / kernels / ops are quite vague (or even vacuous meaning) :( I wonder if vision functions could be grouped into namespaces according to their functionality or just pulled up to the top namespace, like we have for core PyTorch (and even probably |
@vadimkantorov Naming is HARD and we can use some good ideas. :) We are trying to prototype and iterate as quickly as possible to have an initial good proposal. Many of the things you mentioned here should be addressed on the final plan. Once we have a bit more complete solution, we will create a Feedback issue and ask for input from the community. We've done this before (see the Multi-weights API feedback issue) and we will proceed to make the necessary changes to make our solution better based on proposals from the community. So keep an eye out for that, but I'm also happy to ping you once it's out. Rolling out such changes will be done gradually and only after we ensure we have a good solution. |
This issue should spark a discussion about how we want to handle
Feature
's in the future. There are a lot of open questions I'm trying to summarize. I'll give my opinion to each of them. You can find the current implementation undertorchvision.prototype.features
.What are
Feature
's?Feature
's are subclasses oftorch.Tensor
and their purpose is threefold:Image
, they information about the data they carry. The prototype transformations (torchvision.prototype.transforms
) use this information to automatically dispatch an input to the correct kernel.There are currently three
Feature
's implementedImage
,BoundingBox
, andLabel
,but in the future we should add at least three more:
SemanticSegmentationMask
,InstanceSegementationMask
, andVideo
.What is the policy of adding new
Feature
's?We could allow subclassing of
Feature
's. On the one hand, this would make it easier for datasets to conveniently bundle meta data. For example, the COCO dataset could return aCocoLabel
, which in addition to the defaultLabel.category
could also have thesuper_category
field. On the other hand, this would also mean that the transforms need to handle subclasses of features well, for example aCocoLabel
could be treated the same as aLabel
.I see two downsides with that:
CocoLabel.super_category
. Although from the surface this now supports plainLabel
's this will fail at runtime.Thus, I'm leaning towards only having a few base classes.
From what data should a
Feature
be instantiable?Some of the features like
Image
orVideo
have non-tensor objects that carry the data. Should these features know how to handle them? For example should something likeImage(PIL.Image.open(...))
work?My vote is out for yes. IMO this is very convenient and also not an unexpected semantic compared to passing the data directly, e.g.
Image(torch.rand(3, 256, 256))
Should
Feature
's have a fixed shape?Consider the following table:
Feature
.shape
Image
(*, C, H, W)
Label
(*)
BoundingBox
(*, 4)
SemanticSegmentationMask
(*, H, W)
or(*, C, H, W)
InstanceSegementationMask
(*, N, H, W)
Video
(*, T, C, H, W)
(For
SemanticSegmentationMask
I'm not sure about the shape yet. Having an extra channel dimension makes the tensor unnecessarily large, but it aligns well with segmentation image files, which are usually stored as RGB)Should we fix the shape to a single feature, i.e. remove the
*
from the table above, or should we only care about the shape in the last dimensions to be correct?My vote is out for having a flexible shape, since otherwise batching is not possible. For example, if we fix bounding boxes to shape
(4,)
a transformation would need to transformN
bounding boxes individually, while for shape(N, 4)
it could make use of parallelism.On the same note, if we go for the flexible shape, do we keep the singular name of the feature? For example, do we regard a batch of images with shape
(B, C, H, W)
still asImage
or should we go for the pluralImages
in general? My vote is out for always keeping the singular, since I've often seen something like:Should
Feature
's have a fixed dtype?This makes sense for
InstanceSegementationMask
which should always betorch.bool
. For all the other features I'm unsure. My gut says to use a default dtype, but also allow other dtypes.What meta data should
Feature
's carry?IMO, this really depends on the decision above about the fixed / flexible shapes. If we go for fixed shapes, it can basically carry any information. If we go for flexible shapes instead, we should only have meta data, which is the same for batched features. For example,
BoundingBox.image_size
is fine, butLabel.category
is not.What methods should
Feature
's provide?For now I've only included typical conversion methods, but of course this is not exhaustive.
Feature
Image
.to_dtype()
.to_colorspace()
Label
.to_str()
BoundingBox
.to_format()
InstanceSegementationMask
.to_semantic()
cc @bjuncek
The text was updated successfully, but these errors were encountered: