-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NOMRG] TransformsV2 questions / comments #7092
Conversation
As discussed offline with @pmeier , we should bring back the vision/references/segmentation/transforms.py Lines 54 to 64 in 8985b59
In V2 this is all seamlessly supported, but in order to minimize adoption friction of the new transforms, it might be a good idea to keep those methods around (as deprecated). Another use-case was to support transforming different batches (available at different times) with the same RNG as in #3001 (comment). This is something that should be supported in the future in a much better way, once there is a more fine-grained RNG support #7027. For now, this can still be supported if we bring back the |
Re
I'm going to send a draft PR as a basis for discussion of how I envision that we could reinstate them. |
format: BoundingBoxFormat | ||
format: BoundingBoxFormat # TODO: do not use a builtin? | ||
# TODO: This is the size of the image, not the box. Maybe make this explicit in the name? | ||
# Note: if this isn't user-facing, the TODO is not critical at all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is user facing. In general, the metadata of the datapoints is considered public.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spatial_size
was renamed from image_size
once we added support to Videos if I recall correctly. #6736
But I agree it can be unclear if spatial_size
refers to bbox or something else...
@@ -268,4 +302,4 @@ def gaussian_blur(self, kernel_size: List[int], sigma: Optional[List[float]] = N | |||
|
|||
|
|||
InputType = Union[torch.Tensor, PIL.Image.Image, Datapoint] | |||
InputTypeJIT = torch.Tensor | |||
InputTypeJIT = torch.Tensor # why alias it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To have an easier time looking up what the actual type should be. Meaning if you see *JIT
as annotation, you can simply look up *
to see what the actual type should be. If we would use InputType
and torch.Tensor
, this relation is gone.
|
||
# Do we want to keep these public? | ||
# This is probalby related to internal customer needs. TODO for N: figure that out | ||
# This is also related to allow user-defined subclasses and transforms (in anticipation of) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition, resize
is the most problematic one, since it actually overrides the (deprecated) tensor method with a completely different behavior.
@@ -11,6 +11,12 @@ | |||
L = TypeVar("L", bound="_LabelBase") | |||
|
|||
|
|||
# Do we have transforms that change the categories? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No
@@ -11,6 +11,12 @@ | |||
L = TypeVar("L", bound="_LabelBase") | |||
|
|||
|
|||
# Do we have transforms that change the categories? | |||
# Why do we need the labels to be datapoints? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because some transformations need this information. Examples are
class RandomMixup(_BaseMixupCutmix): class RandomCutmix(_BaseMixupCutmix): class SimpleCopyPaste(Transform):
# Saw this from [Feedback] thread (https://github.com/pytorch/vision/issues/6753#issuecomment-1308295943) | ||
# Not sure I understand the need to return a Tensor instead of and Image | ||
# Is the inconsistency "worth" it? How can we make sure this isn't too unexpected? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem here is that most of our image kernels assume that an image is in the range [0, 1.0 if image.is_floating_point() else torch.iinfo(image.dtype).max]
and this assumption is violated after normalization. The image is no longer an RGB image and we felt we needed to make this explicit. Of course any ops not related to color like crop
will still work as before.
As for the impact, I feel this is fine. Normalization is usually one of the last steps if not the last step before something is passed into the model. Plus, since we have the tensor to image fallback, users can still use ops like crop afterwards without issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the details!
For me to better understand the trade-off, I'll try to find reasons to keep the output as Image instead of tensor. LMK what you think.
we have the tensor to image fallback
I guess that also means one can do e.g. Compose(Normalize(), SomeColorTransform())
and while Normalize will output a float tensor, SomeColorTransform()
will treat it as a 0-1
image? Basically, the unwrapping may not be preventing any bug?
I'm also wondering whether this might become obsolete / overkill if we remove the ColorSpace
metadata, which as discussed previously is currently redundant with num_channels
?
Of course any ops not related to color like crop will still work as before.
Do we have transforms that rely on the ColorSpace meta-data right now? I could only find RandomPhotometricDistort
(and it can probably do without)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that also means one can do e.g.
Compose(Normalize(), SomeColorTransform())
and while Normalize will output a float tensor,SomeColorTransform()
will treat it as a0-1
image? Basically, the unwrapping may not be preventing any bug?
Yes and yes. This is more of an idealistic choice. Value range checks are too expensive at runtime and this we can only assume the range. Meaning, this unwrapping (or rather the absent re-wrapping) will basically have no effect at runtime.
I'm also wondering whether this might become obsolete / overkill if we remove the
ColorSpace
metadata, which as discussed previously is currently redundant withnum_channels
?
I don't think so. datapoints.Image
(and datapoints.Video
) carry more implicit assumptions than just the public metadata. An Image
instance implicitly communicates that its values are in the correct range, whereas torch.Tensor
is completely free. Of course if one uses it as image the same implicit assumptions apply, but this comes from the context rather than the type.
Do we have transforms that rely on the ColorSpace meta-data right now? I could only find
RandomPhotometricDistort
(and it can probably do without)
For reference:
vision/torchvision/prototype/transforms/_color.py
Lines 127 to 128 in 93df9a5
if isinstance(inpt, (datapoints.Image, datapoints.Video)): | |
output = inpt.wrap_like(inpt, output, color_space=datapoints.ColorSpace.OTHER) # type: ignore[arg-type] |
First of all, it is an oversight that we are using two different idioms. Both Normalize
and RandomPhotometricDistort
should either return a torch.Tensor
or an Image
with color_space=ColorSpace.OTHER
.
To answer the question, yes it can do without. Apart from that, ConvertColorSpace
also uses it, but we could also guess it from the channels. Although that would drop the guard that we currently have of transforming arbitrary "images", i.e. the normalized ones, to a different color space. Whether this is a good thing is probably open for discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both Normalize and RandomPhotometricDistort should either return a torch.Tensor or an Image with color_space=ColorSpace.OTHER.
I think the reason for RandomPhotometricDistort
to return an Image is that adjusting hue, saturation, etc should not go outside the original color space (RGB -> RGB), however Normalize
returns a tensor due to the fact that float Image is supposed to be between [0, 1] but output normalized tensor can have any kind of ranges... So, I think it is fine that they do not return the same type of object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should not go outside the original color space (RGB -> RGB),
But then why setting it to ColorSpace.OTHER
instead of keeping as ColorSpace.RGB
?
It seems like there are 2 different concepts:
- The color-space of an image: RGB vs RGBA vs greyscale (... vs. YUV vs CMYK which aren't supported in torchvision)
- The range of values, and the current assumption that it is
[0, 1.0 if image.is_floating_point() else torch.iinfo(image.dtype).max]
Conceptually these are 2 different things to me. E.g. I would argue that an RGB image in [0-1] is still an RGB image even after it is normalized: the range assumption may be broken, but it's still an RGB image - just in a non-standard range. Not sure I'm convincing but:
normalize
(out = (X - mean) / std) is "just" a generalization ofconvertImageDtype
(out = X / 255), andconvertImageDtype
preserves theColorSpace
attribute.- passing an RGB image to
normalize
changes its range, but it doesn't convert it to a greyscale or RGBA or YUV / CMYK. - Right now there's a 1:1 mapping between the
ColorSpace
andnum_channels
, so unless a transform changesnum_channels
there's no reason it should also change the color space
So I think there's an argument to be made to return Images instead of Tensors after normalize
: I agree that the assumption about the range is broken, but a) this assumption isn't encoded in the ColorSpace anyway (it's encoded in the dtype) and b) returning a Tensor doesn't prevent any more issue than returning an Image.
Nor sure if I'm too clear but happy to chat more about this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, these are colorspace and data range are two different things but there are relations between them especially when we have to convert from one colorspace to another. For example, RGB to HSV is well-defined for [0-256] and [0-1] ranges, produced HSV image has predefined ranges for 8-bit, 16-bit or float32 dtypes (ref famous opencv link).
So, I think once RGB image is normalized with mean, std
we can't convert it correctly to HSV space or to Gray scale...
But then why setting it to ColorSpace.OTHER instead of keeping as ColorSpace.RGB?
Maybe, due to the fact that we do not have BGR or GRB and other permutations of RGB...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's definitely true that these 2 concepts start to overlap when we're talking about more color-spaces like HSV / CYMK / YUV etc.
But these aren't supported in torchvision, and there is no immediate plan to support them either (I'm not saying never, but still). So at this point in time, the distinction may be somewhat premature and may lock us into a design that we might never actually need?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true that we do not have HSV and other colorspaces right now and I agree that we probably wont support them since tomorrow. However, we are already doing RGB to grayscale and adjusting hue/saturation by internally doing rgb to hsv conversion. I'm happy to return Image
from Normalize
if it helps with API adoption etc but we have to pay attention on output correctness and potential issues this step could bring...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is definitely valid, we should also make sure that returing an Image doesn't lock us either. Do we foresee any hard issue by returning an Image?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we foresee any hard issue by returning an Image?
I don't. Like I stated above, this was an idealistic choice in the first place. Due to our fallback, it made no difference at runtime whatsoever.
# Also this Seems like a subset of query_chw(): | ||
# - can we just have query_chw()? | ||
# - if not, can we share code between the 2? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH I lost track of what exactly is the status here. We have quite few helpers for this and I'm not sure of the state. Worth looking into.
# (This comment applies to the other query_() utils): | ||
# Does this happen? | ||
# What can go terribly wrong if we don't raise an error? | ||
# Should we just document that this returns the size of the first inpt to avoid an error? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this happen?
Hopefully not. If it does, the input data is setup incorrectly. For example two images in the same sample of different shape, or an bounding_box.spatial_size
attribute out of sync with the corresponding image.
What can go terribly wrong if we don't raise an error?
The first problem is that we needed to decide which value we return. sizes.pop()
is not deterministic, so that would be quite bad.
Even if we can decide on a strategy to return a value, best case scenario is that the transform that requested the size fails loudly, albeit with a less expressive error message, when trying to transform an item that doesn't match the returned value. Worst case, the transform silently passes on bad input data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first problem is that we needed to decide which value we return. sizes.pop() is not deterministic, so that would be quite bad.
If we decide to not check for duplicates we don't need to create a set
, so we don't need to pop()
and we can return the size of the first input which is deterministic.
My first instinct would be to be defensive and check for duplicates as done in the current code. I wonder however if there's value in applying the "define errors out of existence" principle here. As you mentioned, the main risk is that the transforms silently returns wrong results - although it's not clear to me yet in which specific scenario that would happen.
(I'm not advocating for either option at this point, I'm merely trying to figure out the tradeoffs)
Do we have a sense of the performance hit incurred by the duplicate checks? If this gets called on large-ish batches (256 samples) with multiple inputs (images, masks, bbox, etc.), does it become noticeable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a sense of the performance hit incurred by the duplicate checks?
Nope. We only tried and benchmarked on our references and there we didn't notice anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep it this way then, thanks
raise ValueError(f"Found multiple HxW dimensions in the sample: {sequence_to_str(sorted(sizes))}") | ||
h, w = sizes.pop() | ||
return h, w | ||
|
||
|
||
# when can types_or_checks be a callable? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_simple_tensor
was the motivating example here.
if isinstance(obj, type_or_check) if isinstance(type_or_check, type) else type_or_check(obj): | ||
return True | ||
return False | ||
|
||
|
||
# Are these public because they are needed for users to implement custom transforms? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are not strictly needed, but make the input checking a lot easier. If users come and look at our source how we do it, we should give them the same tools.
@@ -94,7 +94,7 @@ def _get_params(self, flat_inputs: List[Any]) -> Dict[str, Any]: | |||
def _transform( | |||
self, inpt: Union[datapoints.ImageType, datapoints.VideoType], params: Dict[str, Any] | |||
) -> Union[datapoints.ImageType, datapoints.VideoType]: | |||
if params["v"] is not None: | |||
if params["v"] is not None: # What is this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a value tensor or None used to erase the image
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically it is the replacement that is put in the "erased" area. In v1, in case we didn't find an area to erase, we return the bounding box of the whole image as well as the image
vision/torchvision/transforms/transforms.py
Line 1702 in 01d138d
return 0, 0, img_h, img_w, img |
With that we call F.erase
unconditionally, which ultimately leads to replacing every value in the original image with itself:
vision/torchvision/transforms/functional_tensor.py
Lines 928 to 932 in 01d138d
if not inplace: | |
img = img.clone() | |
img[..., i : i + h, j : j + w] = v | |
return img |
Since that is quite nonsensical, we opted to also allow None
as a return value and use it as a sentinel to do nothing. I think the previous implementation came from a time were JIT didn't support Union
(or Optional
for that matter) and thus we couldn't return Optional[torch.Tensor]
.
# is "feature" now "datapoint" - or is it something else? | ||
# A: yes, TODO: update name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #7117.
@@ -41,13 +43,17 @@ def __new__( | |||
tensor = cls._to_tensor(data, dtype=dtype, device=device, requires_grad=requires_grad) | |||
return tensor.as_subclass(Datapoint) | |||
|
|||
# Is this still needed, considering we won't be releasing the prototype datasets anytime soon? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #7154.
Co-authored-by: Philip Meier <github.pmeier@posteo.de>
Closing - we'll keep track of the progress in other places like #7217 (possibly coming back to this PR) |
Just writing down my thoughts / questions below. Some of the points are already addressed / tracked in #7082.