Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds normalize parameter to ToTensor operation #2060

Closed
wants to merge 3 commits into from
Closed

Adds normalize parameter to ToTensor operation #2060

wants to merge 3 commits into from

Conversation

xksteven
Copy link
Contributor

@xksteven xksteven commented Apr 3, 2020

It should be the case that the divide by 255 is an optional step that is taken when converting to a pytorch tensor. The parameter should default to True however to preserve backwards compatibility.

xksteven added 3 commits April 3, 2020 09:08
Adds normalize parameter to totensor defaults to true for backwards compatibility.
Adds normalize parameter to ToTensor defaults to true for backwards compatibility.
@pmeier
Copy link
Collaborator

pmeier commented Apr 3, 2020

Hi @xksteven, I disagree with your premise

It should be the case that the divide by 255 is an optional step

The native image type of torchvision is a torch.Tensor with dtype==torch.float. As its normal for floating point pixels the values should be from the interval [0.0, 1.0]. Thus, the division should not be optional.

One could then ask why we use values from this interval and not [0.0, 255.0]. Since all real closed intervals (ignoring some edge cases that are not important for my opinion) can be mapped to each other it is indifferent which interval is used. Thus, I think it is reasonable to use the simplest of them: [0.0, 1.0].

Since 255.0 as a upper limit is just a remnant of the max value of images stored with a color depth of 1 byte, I think its fair to ask, why you want to use this interval.


If you want to use [0.0, 255.0] as default value range you could simply create a custom transform:

class FloatToUint8ValueRange:
    def __call__(self, img):
        return img * 255.0

and bundle it after ToTensor:

from torchvision import transforms

transform = transforms.Compose([transforms.ToTensor(), FloatToUint8ValueRange()])

If we want to include this, I suggest that we at least change the name of the parameter, since normalize here could be easily confused with the z-score normalization.

@xksteven
Copy link
Contributor Author

xksteven commented Apr 3, 2020

Hello @pmeier,

One particular use case I was dealing with is loading semantic segmentation maps which are conveniently stored as images and can be loaded nicely as labels with this transform except for the division by 255.

I thought by making the division optional it will allow for the loading of these and other types of images that do not necessarily follow that the maximum value is that of 255. The parameter would allow for much easier convenience instead of forcing to add another transform.

I agree that the use of the word normalize could be confusing. I thought that the word "standardize" would be worse. Do you have an idea of what it should be called?

Thanks for your comments!

@pmeier
Copy link
Collaborator

pmeier commented Apr 3, 2020

One particular use case I was dealing with is loading semantic segmentation maps which are conveniently stored as images and can be loaded nicely as labels with this transform except for the division by 255.

I'm not sure if I understood you right here, but I fail to see why the segmentation maps have to be in the interval [0.0, 255.0]. What are you doing with them that is not possible or at least significantly harder than with maps in the interval [0.0, 1.0]? Could you elaborate?

I thought by making the division optional it will allow for the loading of these and other types of images that do not necessarily follow that the maximum value is that of 255. The parameter would allow for much easier convenience instead of forcing to add another transform.

From this PR I think you load the images with PIL and pass them through the ToTensor() transform to get a torch.Tensor. Couldn't you simply use the ComposedTransform from my comment above instead? I acknowledge that in theory it is not a good idea to first divide by a value and simply multiply by that value again in the next step. Still, I can't imagine a scenario where this could lead to a bottleneck.

I agree that the use of the word normalize could be confusing. I thought that the word "standardize" would be worse. Do you have an idea of what it should be called?

If I had to, and I'm not in favor of that, I would go with something like upper_value_limit=1.0 and simply multiply by that factor once the image is in the interval [0.0, 1.0]. For your case you could then set upper_value_limit=255.0.

@xksteven
Copy link
Contributor Author

xksteven commented Apr 4, 2020

I'm not sure if I understood you right here, but I fail to see why the segmentation maps have to be in the interval [0.0, 255.0]. What are you doing with them that is not possible or at least significantly harder than with maps in the interval [0.0, 1.0]? Could you elaborate?

The semantic segmentation maps are themselves the labels per pixel. So each pixel value represents a class encoded by the integer value 0,1,2 ..., 255. This can be loaded and then passed directly into the loss function (the values might need to be converted into type long first in some cases).

From this PR I think you load the images with PIL and pass them through the ToTensor() transform to get a torch.Tensor. Couldn't you simply use the ComposedTransform from my comment above instead? I acknowledge that in theory it is not a good idea to first divide by a value and simply multiply by that value again in the next step. Still, I can't imagine a scenario where this could lead to a bottleneck.

Honestly I agree. I think this is a very small pull request, and it introduces minimal changes that I personally would like but are perhaps suboptimal. I can't imagine it being a bottleneck I just thought it would be better to make what the ToTensor operation does slightly more transparent. It felt weird to me that this one function was doing several functions together. Not only was it converting the images into a pytorch Tensor but it was also normalizing them in the [0.0, 1.0] range. The main reason it felt weird because the name ToTensor doesn't seem to imply the normalization functionality you have to look at the documentation for that part to become apparent. I thought it might be useful to keep the first functionality of converting images to Tensors and make the second functionality (the normalization aspect) optional.

If the changes are too small and inconvenient I'm open to just closing the pull request. (I accidentally clicked close earlier).

@xksteven xksteven closed this Apr 4, 2020
@xksteven xksteven reopened this Apr 4, 2020
@codecov-io
Copy link

Codecov Report

Merging #2060 into master will decrease coverage by <.01%.
The diff coverage is 0%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #2060      +/-   ##
=========================================
- Coverage    0.48%   0.48%   -0.01%     
=========================================
  Files          92      92              
  Lines        7411    7415       +4     
  Branches     1128    1130       +2     
=========================================
  Hits           36      36              
- Misses       7362    7366       +4     
  Partials       13      13
Impacted Files Coverage Δ
torchvision/transforms/transforms.py 0% <0%> (ø) ⬆️
torchvision/transforms/functional.py 0% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9ed2fa3...cd6e66f. Read the comment docs.

@pmeier
Copy link
Collaborator

pmeier commented Apr 4, 2020

The semantic segmentation maps are themselves the labels per pixel. So each pixel value represents a class encoded by the integer value 0,1,2 ..., 255. This can be loaded and then passed directly into the loss function (the values might need to be converted into type long first in some cases).

If you need to convert them to torch.long anyway, can't you just do the conversion to [0.0, 255.0] there? If yes, you do not need a special transformation for the segmentation masks during the loading.

[...] I just thought it would be better to make what the ToTensor operation does slightly more transparent. It felt weird to me that this one function was doing several functions together. Not only was it converting the images into a pytorch Tensor but it was also normalizing them in the [0.0, 1.0] range. The main reason it felt weird because the name ToTensor doesn't seem to imply the normalization functionality you have to look at the documentation for that part to become apparent.

I think right now we aren't communicating the native image type of torchvision very well since questions about it (or a PR for that matter) keep popping up from time to time. Do you have an idea how to improve that?

If the changes are too small and inconvenient I'm open to just closing the pull request.

Don't give up yet: although I'm in favor of closing this PR, I have no say here. We'll have to wait for a member of PyTorch to drop in and settle this.

@xksteven
Copy link
Contributor Author

xksteven commented Apr 5, 2020

I think right now we aren't communicating the native image type of torchvision very well since questions about it (or a PR for that matter) keep popping up from time to time. Do you have an idea how to improve that?

Maybe there could be an additional parameter to resulting output tensor? Alternatively the ToTensor operation could not change the default input types from int, byte, double automatically into float. I'll see if I can think of other approaches. If one does

transforms.Compose([ ToTensor(), ToPILImage() ]

Information about the mode has been lost in the process and it does not return the same PIL Image that one started with. I believe it may result in an error in some cases (although I'd have to do more tests to confirm it's true about throwing an error).
It would be nice if those two operations resulted in the identity operation to be able to easily convert from images to tensors or vice-versa. What do you think?

@fmassa
Copy link
Member

fmassa commented Apr 7, 2020

Hi,

Thanks for the PR!

I agree with @pmeier comments, and this is something that has been come up a few times already. The underlying issue is that ToTensor has too many implicit assumptions, and can lead to a number of surprising behaviors, see #546 #856 for a few examples.

I think we should re-design this abstraction altogether, so that the contract is clear from the beginning.

I propose to split ToTensor in two:

  • the as_tensor part, which just converts a PIL Image into a torch.Tensor of the original dtype
  • the casting part, which could follow TensorFlow's convert_image_dtype.

This way, the semantics of the functions are clear, and there are no surprises for the users.

Also, when the image reading PRs are merged in torchvision, the as_tensor part will become optional, and only the convert_image_dtype will be necessary.

Thoughts?

@pmeier
Copy link
Collaborator

pmeier commented Apr 7, 2020

  1. So as_tensor would keep to original dtype, correct? For example "normal" PIL image would result in dtype=torch.uint8 while an image with mode F would be dtype==torch.float32 right away.
  2. convert_image_dtype would take in a tensor of arbitrary dtype and return dtype==torch.float32 with values from [0.0, 1.0], correct?

@xksteven
Copy link
Contributor Author

xksteven commented Apr 7, 2020

  1. So as_tensor would keep to original dtype, correct? For example "normal" PIL image would result in dtype=torch.uint8 while an image with mode F would be dtype==torch.float32 right away.

I like this approach.

  1. convert_image_dtype would take in a tensor of arbitrary dtype and return dtype==torch.float32 with values from [0.0, 1.0], correct?

I think the convert_image_dtype should not also do the normalization. It should only do the potential scaling from max of one dtype to another and as tensorflow does they add a parameter to called saturate if the user wishes to avoid overflow or underflow problems. It occurs such as when converting from max float to max int for example.
I really don't like the implicit scaling of values without any warning to the user that the normalization is occurring.

I can work on adding the as_tensor code in a separate pull request and can close this pull request whenever. I don't think this PR fully addresses the core issues anyways. I think the convert_image_dtype will be a bit more involved coding wise but I'd be happy to take a look after I or someone else finishes the first one. I think we'd need a few helper functions to deal with the saturating cast and I'm not sure where to best put them.

@pmeier
Copy link
Collaborator

pmeier commented Apr 7, 2020

I think the convert_image_dtype should not also do the normalization.

If as_tensor and convert_image_dtype should replace to_tensor, one has to implement the scaling. So if you want a conversion from dtype==torch.uint8 to dtype==torch.float32 you will get values from [0.0, 1.0] even if we split the functionality of to_tensor. If I read that correctly, tensorflow does it the same.

If you just want to change the dtype without any scaling just call .to(dtype) on your image.

@xksteven
Copy link
Contributor Author

xksteven commented Apr 7, 2020

If as_tensor and convert_image_dtype should replace to_tensor, one has to implement the scaling. So if you want a conversion from dtype==torch.uint8 to dtype==torch.float32 you will get values from [0.0, 1.0] even if we split the functionality of to_tensor. If I read that correctly, tensorflow does it the same.

If you just want to change the dtype without any scaling just call .to(dtype) on your image.

You're right they do the scaling. From their documentation it was not apparent to me that they were doing the scaling because MAX was described as largest positive representable number

Image data stored in integer data types are
expected to have values in the range [0,MAX], where MAX is the largest
positive representable number for the data type.

I'll withdraw my criticism as it appears everyone else is doing the conversion and normalization, and the "as_tensor" function would handle my own use cases as well.

@fmassa
Copy link
Member

fmassa commented Apr 8, 2020

@pmeier correct about your first point. One thing to consider is if we want to do the transposition from HWC to CHW in as_tensor, I think we should, but adding the memory_layout information so that there is no copy.

About your second point:

convert_image_dtype would take in a tensor of arbitrary dtype and return dtype==torch.float32 with values from [0.0, 1.0], correct?

Partially, as the user can specify if he wants to convert to a float type (the default) or a different type.

I would be happy to review PRs addressing those two points. And for ToTensor, I would leave it as is, and mark it as deprecated. Maybe we should also revisit ToPILImage on the way.

@xksteven
Copy link
Contributor Author

I'll close the pull request and begin working on the new function :)

@xksteven xksteven closed this Apr 10, 2020
@fmassa
Copy link
Member

fmassa commented Apr 10, 2020

Thanks @xksteven !

Just a heads up that @pmeier is already working on the convert_image_dytpe in #2078

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants