Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Type cast before normalize #18694

Conversation

amyeroberts
Copy link
Collaborator

What does this PR do?

This shows the changes to VideoMAE - casting the images to numpy arrays before normalizing. This resolves issues when the return type isn't as expected if flags like do_normalize are false. The changes here are representative of the changes for all vision model feature extractors.

Once this has been approved. All model changes will be merged into this one for a final review before merging. A new PR was opened as I couldn't switch to the base repo (huggingface/transformers.git).

First PR introducing changes to the transforms to enable this: #18499 (comment)
Details below copied from this PR (for easy reference):

Other model PRs to be merged in:

Details

At the moment, if do_normalize=False, do_resize=True and return_tensors=None then the output tensors will be a list of PIL.Image.Image objects if even if the inputs are numpy arrays. If do_normalize=False and return_tensors is specified ("pt", "np", "tf", "jax") an exception is raised.

The main reasons for this are:

BatchFeature can't convert PIL.Image.Image to the requested tensors.
The necessary conversion of PIL.Image.Image -> np.ndarray happens within the normalize method and the output of resize is PIL.Image.Image.
In order to have the type of the returned pixel_values reflect return_tensors we need to:

Convert PIL.Image.Image objects to numpy arrays before passing to BatchFeature
Be able to optionally rescale the inputs in the normalize method. If the input to normalize is a PIL.Image.Image it is converted to a numpy array using to_numpy_array which rescales to between [0, 1]. If do_resize=False then this rescaling won't happen if the inputs are numpy arrays.
The optional flags enable us to preserve the same default behaviour for the resize and normalize methods whilst modifying the internal logic of the feature extractor call.

Checks
The model PRs are all cherry picked (file diffs) of type-cast-before-normalize

The following was run to check the outputs:

from dataclasses import dataclass

import requests
import numpy as np
from PIL import Image
import pygit2
from transformers import AutoFeatureExtractor

@dataclass
class FeatureExtractorConfig:
    model_name: str
    checkpoint: str
    return_type: str = "np"
    feat_name: str = "pixel_values"

IMAGE_FEATURE_EXTRACTOR_CONFIGS = [
    FeatureExtractorConfig(model_name="clip", checkpoint="openai/clip-vit-base-patch32"),
    FeatureExtractorConfig(model_name="convnext", checkpoint="facebook/convnext-tiny-224"),
    FeatureExtractorConfig(model_name="deit", checkpoint="facebook/deit-base-distilled-patch16-224"),
    FeatureExtractorConfig(model_name="detr", checkpoint="facebook/detr-resnet-50"),
    FeatureExtractorConfig(model_name="dpt", checkpoint="Intel/dpt-large"),
    FeatureExtractorConfig(model_name="flava", checkpoint="facebook/flava-full"),
    FeatureExtractorConfig(model_name="glpn", checkpoint="vinvino02/glpn-kitti"),
    FeatureExtractorConfig(model_name="imagegpt", checkpoint="openai/imagegpt-small", feat_name='input_ids'),
    FeatureExtractorConfig(model_name="layoutlmv2", checkpoint="microsoft/layoutlmv2-base-uncased"),
    FeatureExtractorConfig(model_name="layoutlmv3", checkpoint="microsoft/layoutlmv3-base"),
    FeatureExtractorConfig(model_name="levit", checkpoint="facebook/levit-128S"),
    FeatureExtractorConfig(model_name="maskformer", checkpoint="facebook/maskformer-swin-base-ade", return_type="pt"),
    FeatureExtractorConfig(model_name="mobilevit", checkpoint="apple/mobilevit-small"),
    FeatureExtractorConfig(model_name="owlvit", checkpoint="google/owlvit-base-patch32"),
    FeatureExtractorConfig(model_name="perceiver", checkpoint="deepmind/vision-perceiver-fourier"),
    FeatureExtractorConfig(model_name="poolformer", checkpoint="sail/poolformer_s12"),
    FeatureExtractorConfig(model_name="segformer", checkpoint="nvidia/mit-b0"),
    FeatureExtractorConfig(model_name="vilt", checkpoint="dandelin/vilt-b32-mlm"),
    FeatureExtractorConfig(model_name="vit", checkpoint="google/vit-base-patch16-224-in21k"),
    FeatureExtractorConfig(model_name="yolos", checkpoint="hustvl/yolos-small"),
]

VIDEO_FEATURE_EXTRACTOR_CONFIGS = [
	FeatureExtractorConfig(model_name="videomae", checkpoint="MCG-NJU/videomae-base"),
]

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

def produce_pixel_value_outputs():
    BRANCH = pygit2.Repository('.').head.shorthand

    def get_processed_outputs(inputs, model_checkpoint, feat_name):
        feature_extractor = AutoFeatureExtractor.from_pretrained(model_checkpoint)
        outputs = feature_extractor(inputs, return_tensors=fe_config.return_type)[feat_name]
        return outputs

    for fe_config in IMAGE_FEATURE_EXTRACTOR_CONFIGS:
        print(fe_config.model_name, fe_config.checkpoint)
        outputs = get_processed_outputs(image, fe_config.checkpoint, fe_config.feat_name)
        np.save(f"{fe_config.model_name}_{BRANCH.replace('-', '_')}_pixel_values.npy", outputs)

    for fe_config in VIDEO_FEATURE_EXTRACTOR_CONFIGS:
        print(fe_config.model_name, fe_config.checkpoint)
        outputs = get_processed_outputs([[image, image]], fe_config.checkpoint, fe_config.feat_name)
        np.save(f"{fe_config.model_name}_{BRANCH.replace('-', '_')}_pixel_values.npy", outputs)

branch_main = "main"
branch_feature = "type-cast-before-normalize"

repo = pygit2.Repository('.git')

print("\nChecking out main")
branch = repo.lookup_branch('main')
ref = repo.lookup_reference(branch.name)
repo.checkout(ref)

produce_pixel_value_outputs()

print("\nChecking out type-cast-before-normalize")
branch = repo.lookup_branch('type-cast-before-normalize')
ref = repo.lookup_reference(branch.name)
repo.checkout(ref)

produce_pixel_value_outputs()

for fe_config in IMAGE_FEATURE_EXTRACTOR_CONFIGS + VIDEO_FEATURE_EXTRACTOR_CONFIGS:
    model_name = fe_config.model_name

    try:
        output_1 = np.load(f"{model_name}_{branch_main}_pixel_values.npy")
        output_2 = np.load(f"{model_name}_{branch_feature.replace('-', '_')}_pixel_values.npy")

        max_diff = np.amax(np.abs(output_1 - output_2))
        print(f"{model_name}: {max_diff:.5f}")
    except Exception as e:
        print(f"{model_name} failed check with {e}")

Output:

clip: 0.00000
convnext: 0.00000
deit: 0.00000
detr: 0.00000
dpt: 0.00000
flava: 0.00000
glpn: 0.00000
imagegpt: 0.00000
layoutlmv2: 0.00000
layoutlmv3: 0.00000
levit: 0.00000
maskformer: 0.00000
mobilevit: 0.00000
owlvit: 0.00000
perceiver: 0.00000
poolformer: 0.00000
segformer: 0.00000
vilt: 0.00000
vit: 0.00000
yolos: 0.00000
videomae: 0.00000
Fixes
https://github.com/huggingface/transformers/issues/17714
https://github.com/huggingface/transformers/issues/15055

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

This is necessary to allow for casting our images / videos to numpy arrays within the feature extractors' call. We want to do this to make sure the behaviour is as expected when flags like  are False. If some transformations aren't applied, then the output type can't be unexpected e.g. a list of PIL images instead of numpy arrays.
Copy link
Contributor

@alaradirik alaradirik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! I just have a few comments about image_utils.py as it undoes your recent PR

amyeroberts and others added 4 commits August 19, 2022 15:24
Co-authored-by: Alara Dirik <8944735+alaradirik@users.noreply.github.com>
Co-authored-by: Alara Dirik <8944735+alaradirik@users.noreply.github.com>
Co-authored-by: Alara Dirik <8944735+alaradirik@users.noreply.github.com>
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@@ -97,6 +97,9 @@ def normalize_video(self, video, mean, std):

return (video - mean[None, :, None, None]) / std[None, :, None, None]

def to_numpy_array_video(self, video, rescale=None, channel_first=True):
return [self.to_numpy_array(frame, rescale, channel_first) for frame in video]

def __call__(
self, videos: ImageInput, return_tensors: Optional[Union[str, TensorType]] = None, **kwargs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry a bit unrelated to the PR, but why do we have **kwargs here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! I'm honestly not sure. cc @NielsRogge

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, seems like **kwargs is just leftover code.

@@ -97,6 +97,9 @@ def normalize_video(self, video, mean, std):

return (video - mean[None, :, None, None]) / std[None, :, None, None]

def to_numpy_array_video(self, video, rescale=None, channel_first=True):
return [self.to_numpy_array(frame, rescale, channel_first) for frame in video]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also a bit unrelated: slightly confusing to me that to_numpy_array does things like rescaling and transposing, e.g. I wouldn't have expected:

image = image.transpose(2, 0, 1)   

or

image = image.astype(np.float32) / 255.0

to be in a function that is called to_numpy_array

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly isn't make_channel_first always False given that it has already been done before?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also a bit unrelated: slightly confusing to me that to_numpy_array does things like rescaling and transposing e.g. I wouldn't have expected:

I completely agree! As part of the work replacing feature extractors with image processors methods like to_numpy_array and normalize are being reworked so logic like rescaling is taken out.

# video can be a list of PIL images, list of NumPy arrays or list of PyTorch tensors
# first: convert to list of NumPy arrays
video = [self.to_numpy_array(frame) for frame in video]
video = [self.to_numpy_array(frame, rescale=rescale, channel_first=channel_first) for frame in video]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When normalizing shouldn't rescale always be True or does it make a difference? The function normalize_video would IMO be easier to understand / read when rescale is not an argument but instead we copy-paste this part of the code:

image = image.astype(np.float32) / 255.0

directly into the function.
Also isn't video here already a numpy array so do we have to call self.to_numpy_array at all?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When normalizing shouldn't rescale always be True or does it make a difference?

Good question.

It's made me realise a case when it might not be in __call__: an input numpy/torch.tensor image with values rescaled between 0-1 and do_resize=False, do_crop=False.

In general it's not always True. The previous behaviour was that the image is rescaled if the input is a PIL image, but not rescaled if it's a numpy array or torch tensor.

As we convert to np.ndarray before normalize we need to make sure that:

  1. rescaling still happens if do_normalize=True and the image was converted from PIL -> numpy
  2. rescaling still happens if do_normalize=True and the image has values between 0-255
  3. rescaling doesn't happen if do_normalize is False
  4. normalize keeps its previous default behaviour

It's missing an explicit check for 1. & 2. I'll add it now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function normalize_video would IMO be easier to understand / read when rescale is not an argument but instead we copy-paste this part of the code: image = image.astype(np.float32) / 255.0

Yes, I agree, the flag isn't super clear. For backwards compatibility, it's still necessary to not have it rescale by default on numpy arrays. What do you think is the best way to handle? I could move the rescaling to be within the feature extractor __call__ ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also isn't video here already a numpy array so do we have to call self.to_numpy_array at all?

This was for consistency with other vision feature extractors. They can take PIL images as input and will convert them to numpy arrays (as well as rescale and transpose their axes). For them it's needed for backwards compatibility as the method is public. For this model's feature extractor, we could remove as it's not been released yet. It would remove redundancy but would depart from other normalize methods. What do you think is best?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I think consistency is an important point as well!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok if it's still necessary to not have it rescale by default then I think it makes sense to leave as is!

Just more generally I guess it'd be nice to move scaling and channel re-ordering out of to_numpy_array - but that's probs better in another PR :-)

@@ -159,8 +162,20 @@ def __call__(
videos = [self.resize_video(video, size=self.size, resample=self.resample) for video in videos]
if self.do_center_crop and self.size is not None:
videos = [self.crop_video(video, size=self.size) for video in videos]

# if do_normalize=False, the casting to a numpy array won't happen, so we need to do it here
Copy link
Contributor

@patrickvonplaten patrickvonplaten Aug 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would maybe just write in the comment here:
# cast to numpy array

and then not call self.to_numpy_array in normalize_video (see comments above) think this could make this code easier to read/understand

Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super nice PR description!

I left a couple of comments as I went through the code, some of which I think are a bit outside of this PR.

From taking a quick look it looks good to me, but maybe we could make the code a bit more readable / understandable, but passing less arguments to self.normalize_video(...) and not calling to_numpy_array twice when normalizing? But my comments can very well miss some logic as I don't know the code well, so feel free to ignore ;-)

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this Oct 6, 2022
@amyeroberts
Copy link
Collaborator Author

PR wasn't merged as it was superseded by the image processors and no longer needed.

@amyeroberts amyeroberts deleted the type-cast-before-normalize-videomae branch November 30, 2022 16:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants