-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Type cast before normalize #18694
Type cast before normalize #18694
Conversation
This is necessary to allow for casting our images / videos to numpy arrays within the feature extractors' call. We want to do this to make sure the behaviour is as expected when flags like are False. If some transformations aren't applied, then the output type can't be unexpected e.g. a list of PIL images instead of numpy arrays.
…th different configs
…th different configs
…st-before-normalize-videomae
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! I just have a few comments about image_utils.py
as it undoes your recent PR
Co-authored-by: Alara Dirik <8944735+alaradirik@users.noreply.github.com>
Co-authored-by: Alara Dirik <8944735+alaradirik@users.noreply.github.com>
Co-authored-by: Alara Dirik <8944735+alaradirik@users.noreply.github.com>
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
@@ -97,6 +97,9 @@ def normalize_video(self, video, mean, std): | |||
|
|||
return (video - mean[None, :, None, None]) / std[None, :, None, None] | |||
|
|||
def to_numpy_array_video(self, video, rescale=None, channel_first=True): | |||
return [self.to_numpy_array(frame, rescale, channel_first) for frame in video] | |||
|
|||
def __call__( | |||
self, videos: ImageInput, return_tensors: Optional[Union[str, TensorType]] = None, **kwargs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry a bit unrelated to the PR, but why do we have **kwargs
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question! I'm honestly not sure. cc @NielsRogge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, seems like **kwargs
is just leftover code.
@@ -97,6 +97,9 @@ def normalize_video(self, video, mean, std): | |||
|
|||
return (video - mean[None, :, None, None]) / std[None, :, None, None] | |||
|
|||
def to_numpy_array_video(self, video, rescale=None, channel_first=True): | |||
return [self.to_numpy_array(frame, rescale, channel_first) for frame in video] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also a bit unrelated: slightly confusing to me that to_numpy_array
does things like rescaling and transposing, e.g. I wouldn't have expected:
image = image.transpose(2, 0, 1)
or
image = image.astype(np.float32) / 255.0
to be in a function that is called to_numpy_array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly isn't make_channel_first
always False
given that it has already been done before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also a bit unrelated: slightly confusing to me that to_numpy_array does things like rescaling and transposing e.g. I wouldn't have expected:
I completely agree! As part of the work replacing feature extractors with image processors methods like to_numpy_array
and normalize
are being reworked so logic like rescaling is taken out.
# video can be a list of PIL images, list of NumPy arrays or list of PyTorch tensors | ||
# first: convert to list of NumPy arrays | ||
video = [self.to_numpy_array(frame) for frame in video] | ||
video = [self.to_numpy_array(frame, rescale=rescale, channel_first=channel_first) for frame in video] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When normalizing shouldn't rescale
always be True or does it make a difference? The function normalize_video
would IMO be easier to understand / read when rescale
is not an argument but instead we copy-paste this part of the code:
image = image.astype(np.float32) / 255.0
directly into the function.
Also isn't video
here already a numpy array so do we have to call self.to_numpy_array
at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When normalizing shouldn't rescale always be True or does it make a difference?
Good question.
It's made me realise a case when it might not be in __call__
: an input numpy/torch.tensor image with values rescaled between 0-1 and do_resize=False
, do_crop=False
.
In general it's not always True. The previous behaviour was that the image is rescaled if the input is a PIL image, but not rescaled if it's a numpy array or torch tensor.
As we convert to np.ndarray
before normalize
we need to make sure that:
- rescaling still happens if
do_normalize=True
and the image was converted from PIL -> numpy - rescaling still happens if
do_normalize=True
and the image has values between 0-255 - rescaling doesn't happen if
do_normalize
is False normalize
keeps its previous default behaviour
It's missing an explicit check for 1. & 2. I'll add it now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function normalize_video would IMO be easier to understand / read when rescale is not an argument but instead we copy-paste this part of the code: image = image.astype(np.float32) / 255.0
Yes, I agree, the flag isn't super clear. For backwards compatibility, it's still necessary to not have it rescale by default on numpy arrays. What do you think is the best way to handle? I could move the rescaling to be within the feature extractor __call__
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also isn't video here already a numpy array so do we have to call self.to_numpy_array at all?
This was for consistency with other vision feature extractors. They can take PIL images as input and will convert them to numpy arrays (as well as rescale and transpose their axes). For them it's needed for backwards compatibility as the method is public. For this model's feature extractor, we could remove as it's not been released yet. It would remove redundancy but would depart from other normalize methods. What do you think is best?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No I think consistency is an important point as well!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok if it's still necessary to not have it rescale by default then I think it makes sense to leave as is!
Just more generally I guess it'd be nice to move scaling and channel re-ordering out of to_numpy_array
- but that's probs better in another PR :-)
@@ -159,8 +162,20 @@ def __call__( | |||
videos = [self.resize_video(video, size=self.size, resample=self.resample) for video in videos] | |||
if self.do_center_crop and self.size is not None: | |||
videos = [self.crop_video(video, size=self.size) for video in videos] | |||
|
|||
# if do_normalize=False, the casting to a numpy array won't happen, so we need to do it here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would maybe just write in the comment here:
# cast to numpy array
and then not call self.to_numpy_array
in normalize_video
(see comments above) think this could make this code easier to read/understand
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super nice PR description!
I left a couple of comments as I went through the code, some of which I think are a bit outside of this PR.
From taking a quick look it looks good to me, but maybe we could make the code a bit more readable / understandable, but passing less arguments to self.normalize_video(...)
and not calling to_numpy_array
twice when normalizing? But my comments can very well miss some logic as I don't know the code well, so feel free to ignore ;-)
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
PR wasn't merged as it was superseded by the image processors and no longer needed. |
What does this PR do?
This shows the changes to VideoMAE - casting the images to numpy arrays before normalizing. This resolves issues when the return type isn't as expected if flags like
do_normalize
are false. The changes here are representative of the changes for all vision model feature extractors.Once this has been approved. All model changes will be merged into this one for a final review before merging. A new PR was opened as I couldn't switch to the base repo (huggingface/transformers.git).
First PR introducing changes to the transforms to enable this: #18499 (comment)
Details below copied from this PR (for easy reference):
Other model PRs to be merged in:
Details
At the moment, if do_normalize=False, do_resize=True and return_tensors=None then the output tensors will be a list of PIL.Image.Image objects if even if the inputs are numpy arrays. If do_normalize=False and return_tensors is specified ("pt", "np", "tf", "jax") an exception is raised.
The main reasons for this are:
BatchFeature can't convert PIL.Image.Image to the requested tensors.
The necessary conversion of PIL.Image.Image -> np.ndarray happens within the normalize method and the output of resize is PIL.Image.Image.
In order to have the type of the returned pixel_values reflect return_tensors we need to:
Convert PIL.Image.Image objects to numpy arrays before passing to BatchFeature
Be able to optionally rescale the inputs in the normalize method. If the input to normalize is a PIL.Image.Image it is converted to a numpy array using to_numpy_array which rescales to between [0, 1]. If do_resize=False then this rescaling won't happen if the inputs are numpy arrays.
The optional flags enable us to preserve the same default behaviour for the resize and normalize methods whilst modifying the internal logic of the feature extractor call.
Checks
The model PRs are all cherry picked (file diffs) of type-cast-before-normalize
The following was run to check the outputs:
Output:
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.