Type cast before normalize #18694

amyeroberts · 2022-08-19T11:04:07Z

What does this PR do?

This shows the changes to VideoMAE - casting the images to numpy arrays before normalizing. This resolves issues when the return type isn't as expected if flags like do_normalize are false. The changes here are representative of the changes for all vision model feature extractors.

Once this has been approved. All model changes will be merged into this one for a final review before merging. A new PR was opened as I couldn't switch to the base repo (huggingface/transformers.git).

First PR introducing changes to the transforms to enable this: #18499 (comment)
Details below copied from this PR (for easy reference):

Other model PRs to be merged in:

Details

At the moment, if do_normalize=False, do_resize=True and return_tensors=None then the output tensors will be a list of PIL.Image.Image objects if even if the inputs are numpy arrays. If do_normalize=False and return_tensors is specified ("pt", "np", "tf", "jax") an exception is raised.

The main reasons for this are:

BatchFeature can't convert PIL.Image.Image to the requested tensors.
The necessary conversion of PIL.Image.Image -> np.ndarray happens within the normalize method and the output of resize is PIL.Image.Image.
In order to have the type of the returned pixel_values reflect return_tensors we need to:

Convert PIL.Image.Image objects to numpy arrays before passing to BatchFeature
Be able to optionally rescale the inputs in the normalize method. If the input to normalize is a PIL.Image.Image it is converted to a numpy array using to_numpy_array which rescales to between [0, 1]. If do_resize=False then this rescaling won't happen if the inputs are numpy arrays.
The optional flags enable us to preserve the same default behaviour for the resize and normalize methods whilst modifying the internal logic of the feature extractor call.

Checks
The model PRs are all cherry picked (file diffs) of type-cast-before-normalize

The following was run to check the outputs:

from dataclasses import dataclass

import requests
import numpy as np
from PIL import Image
import pygit2
from transformers import AutoFeatureExtractor

@dataclass
class FeatureExtractorConfig:
    model_name: str
    checkpoint: str
    return_type: str = "np"
    feat_name: str = "pixel_values"

IMAGE_FEATURE_EXTRACTOR_CONFIGS = [
    FeatureExtractorConfig(model_name="clip", checkpoint="openai/clip-vit-base-patch32"),
    FeatureExtractorConfig(model_name="convnext", checkpoint="facebook/convnext-tiny-224"),
    FeatureExtractorConfig(model_name="deit", checkpoint="facebook/deit-base-distilled-patch16-224"),
    FeatureExtractorConfig(model_name="detr", checkpoint="facebook/detr-resnet-50"),
    FeatureExtractorConfig(model_name="dpt", checkpoint="Intel/dpt-large"),
    FeatureExtractorConfig(model_name="flava", checkpoint="facebook/flava-full"),
    FeatureExtractorConfig(model_name="glpn", checkpoint="vinvino02/glpn-kitti"),
    FeatureExtractorConfig(model_name="imagegpt", checkpoint="openai/imagegpt-small", feat_name='input_ids'),
    FeatureExtractorConfig(model_name="layoutlmv2", checkpoint="microsoft/layoutlmv2-base-uncased"),
    FeatureExtractorConfig(model_name="layoutlmv3", checkpoint="microsoft/layoutlmv3-base"),
    FeatureExtractorConfig(model_name="levit", checkpoint="facebook/levit-128S"),
    FeatureExtractorConfig(model_name="maskformer", checkpoint="facebook/maskformer-swin-base-ade", return_type="pt"),
    FeatureExtractorConfig(model_name="mobilevit", checkpoint="apple/mobilevit-small"),
    FeatureExtractorConfig(model_name="owlvit", checkpoint="google/owlvit-base-patch32"),
    FeatureExtractorConfig(model_name="perceiver", checkpoint="deepmind/vision-perceiver-fourier"),
    FeatureExtractorConfig(model_name="poolformer", checkpoint="sail/poolformer_s12"),
    FeatureExtractorConfig(model_name="segformer", checkpoint="nvidia/mit-b0"),
    FeatureExtractorConfig(model_name="vilt", checkpoint="dandelin/vilt-b32-mlm"),
    FeatureExtractorConfig(model_name="vit", checkpoint="google/vit-base-patch16-224-in21k"),
    FeatureExtractorConfig(model_name="yolos", checkpoint="hustvl/yolos-small"),
]

VIDEO_FEATURE_EXTRACTOR_CONFIGS = [
	FeatureExtractorConfig(model_name="videomae", checkpoint="MCG-NJU/videomae-base"),
]

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

def produce_pixel_value_outputs():
    BRANCH = pygit2.Repository('.').head.shorthand

    def get_processed_outputs(inputs, model_checkpoint, feat_name):
        feature_extractor = AutoFeatureExtractor.from_pretrained(model_checkpoint)
        outputs = feature_extractor(inputs, return_tensors=fe_config.return_type)[feat_name]
        return outputs

    for fe_config in IMAGE_FEATURE_EXTRACTOR_CONFIGS:
        print(fe_config.model_name, fe_config.checkpoint)
        outputs = get_processed_outputs(image, fe_config.checkpoint, fe_config.feat_name)
        np.save(f"{fe_config.model_name}_{BRANCH.replace('-', '_')}_pixel_values.npy", outputs)

    for fe_config in VIDEO_FEATURE_EXTRACTOR_CONFIGS:
        print(fe_config.model_name, fe_config.checkpoint)
        outputs = get_processed_outputs([[image, image]], fe_config.checkpoint, fe_config.feat_name)
        np.save(f"{fe_config.model_name}_{BRANCH.replace('-', '_')}_pixel_values.npy", outputs)

branch_main = "main"
branch_feature = "type-cast-before-normalize"

repo = pygit2.Repository('.git')

print("\nChecking out main")
branch = repo.lookup_branch('main')
ref = repo.lookup_reference(branch.name)
repo.checkout(ref)

produce_pixel_value_outputs()

print("\nChecking out type-cast-before-normalize")
branch = repo.lookup_branch('type-cast-before-normalize')
ref = repo.lookup_reference(branch.name)
repo.checkout(ref)

produce_pixel_value_outputs()

for fe_config in IMAGE_FEATURE_EXTRACTOR_CONFIGS + VIDEO_FEATURE_EXTRACTOR_CONFIGS:
    model_name = fe_config.model_name

    try:
        output_1 = np.load(f"{model_name}_{branch_main}_pixel_values.npy")
        output_2 = np.load(f"{model_name}_{branch_feature.replace('-', '_')}_pixel_values.npy")

        max_diff = np.amax(np.abs(output_1 - output_2))
        print(f"{model_name}: {max_diff:.5f}")
    except Exception as e:
        print(f"{model_name} failed check with {e}")

Output:

clip: 0.00000
convnext: 0.00000
deit: 0.00000
detr: 0.00000
dpt: 0.00000
flava: 0.00000
glpn: 0.00000
imagegpt: 0.00000
layoutlmv2: 0.00000
layoutlmv3: 0.00000
levit: 0.00000
maskformer: 0.00000
mobilevit: 0.00000
owlvit: 0.00000
perceiver: 0.00000
poolformer: 0.00000
segformer: 0.00000
vilt: 0.00000
vit: 0.00000
yolos: 0.00000
videomae: 0.00000
Fixes
https://github.com/huggingface/transformers/issues/17714
https://github.com/huggingface/transformers/issues/15055

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

This is necessary to allow for casting our images / videos to numpy arrays within the feature extractors' call. We want to do this to make sure the behaviour is as expected when flags like are False. If some transformations aren't applied, then the output type can't be unexpected e.g. a list of PIL images instead of numpy arrays.

…th different configs

…st-before-normalize-videomae

alaradirik

Looks good to me! I just have a few comments about image_utils.py as it undoes your recent PR

src/transformers/image_utils.py

tests/models/videomae/test_feature_extraction_videomae.py

Co-authored-by: Alara Dirik <8944735+alaradirik@users.noreply.github.com>

HuggingFaceDocBuilderDev · 2022-08-19T14:46:17Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

patrickvonplaten · 2022-08-19T19:14:08Z

src/transformers/models/videomae/feature_extraction_videomae.py

@@ -97,6 +97,9 @@ def normalize_video(self, video, mean, std):

        return (video - mean[None, :, None, None]) / std[None, :, None, None]

+    def to_numpy_array_video(self, video, rescale=None, channel_first=True):
+        return [self.to_numpy_array(frame, rescale, channel_first) for frame in video]
+
    def __call__(
        self, videos: ImageInput, return_tensors: Optional[Union[str, TensorType]] = None, **kwargs


Sorry a bit unrelated to the PR, but why do we have **kwargs here?

Good question! I'm honestly not sure. cc @NielsRogge

+1, seems like **kwargs is just leftover code.

patrickvonplaten · 2022-08-19T19:21:35Z

src/transformers/models/videomae/feature_extraction_videomae.py

@@ -97,6 +97,9 @@ def normalize_video(self, video, mean, std):

        return (video - mean[None, :, None, None]) / std[None, :, None, None]

+    def to_numpy_array_video(self, video, rescale=None, channel_first=True):
+        return [self.to_numpy_array(frame, rescale, channel_first) for frame in video]


Also a bit unrelated: slightly confusing to me that to_numpy_array does things like rescaling and transposing, e.g. I wouldn't have expected:

image = image.transpose(2, 0, 1)

or

image = image.astype(np.float32) / 255.0

to be in a function that is called to_numpy_array

Similarly isn't make_channel_first always False given that it has already been done before?

Also a bit unrelated: slightly confusing to me that to_numpy_array does things like rescaling and transposing e.g. I wouldn't have expected:

I completely agree! As part of the work replacing feature extractors with image processors methods like to_numpy_array and normalize are being reworked so logic like rescaling is taken out.

patrickvonplaten · 2022-08-19T19:29:34Z

src/transformers/models/videomae/feature_extraction_videomae.py

        # video can be a list of PIL images, list of NumPy arrays or list of PyTorch tensors
        # first: convert to list of NumPy arrays
-        video = [self.to_numpy_array(frame) for frame in video]
+        video = [self.to_numpy_array(frame, rescale=rescale, channel_first=channel_first) for frame in video]


When normalizing shouldn't rescale always be True or does it make a difference? The function normalize_video would IMO be easier to understand / read when rescale is not an argument but instead we copy-paste this part of the code:

image = image.astype(np.float32) / 255.0

directly into the function.
Also isn't video here already a numpy array so do we have to call self.to_numpy_array at all?

When normalizing shouldn't rescale always be True or does it make a difference?

Good question.

It's made me realise a case when it might not be in __call__: an input numpy/torch.tensor image with values rescaled between 0-1 and do_resize=False, do_crop=False.

In general it's not always True. The previous behaviour was that the image is rescaled if the input is a PIL image, but not rescaled if it's a numpy array or torch tensor.

As we convert to np.ndarray before normalize we need to make sure that:

rescaling still happens if do_normalize=True and the image was converted from PIL -> numpy

rescaling still happens if do_normalize=True and the image has values between 0-255

rescaling doesn't happen if do_normalize is False

normalize keeps its previous default behaviour

It's missing an explicit check for 1. & 2. I'll add it now.

The function normalize_video would IMO be easier to understand / read when rescale is not an argument but instead we copy-paste this part of the code: image = image.astype(np.float32) / 255.0

Yes, I agree, the flag isn't super clear. For backwards compatibility, it's still necessary to not have it rescale by default on numpy arrays. What do you think is the best way to handle? I could move the rescaling to be within the feature extractor __call__ ?

Also isn't video here already a numpy array so do we have to call self.to_numpy_array at all?

This was for consistency with other vision feature extractors. They can take PIL images as input and will convert them to numpy arrays (as well as rescale and transpose their axes). For them it's needed for backwards compatibility as the method is public. For this model's feature extractor, we could remove as it's not been released yet. It would remove redundancy but would depart from other normalize methods. What do you think is best?

No I think consistency is an important point as well!

Ah ok if it's still necessary to not have it rescale by default then I think it makes sense to leave as is!

Just more generally I guess it'd be nice to move scaling and channel re-ordering out of to_numpy_array - but that's probs better in another PR :-)

patrickvonplaten · 2022-08-19T19:31:41Z

src/transformers/models/videomae/feature_extraction_videomae.py

@@ -159,8 +162,20 @@ def __call__(
            videos = [self.resize_video(video, size=self.size, resample=self.resample) for video in videos]
        if self.do_center_crop and self.size is not None:
            videos = [self.crop_video(video, size=self.size) for video in videos]
+
+        # if do_normalize=False, the casting to a numpy array won't happen, so we need to do it here


I would maybe just write in the comment here:
# cast to numpy array

and then not call self.to_numpy_array in normalize_video (see comments above) think this could make this code easier to read/understand

patrickvonplaten

Super nice PR description!

I left a couple of comments as I went through the code, some of which I think are a bit outside of this PR.

From taking a quick look it looks good to me, but maybe we could make the code a bit more readable / understandable, but passing less arguments to self.normalize_video(...) and not calling to_numpy_array twice when normalizing? But my comments can very well miss some logic as I don't know the code well, so feel free to ignore ;-)

github-actions · 2022-09-27T15:03:18Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

amyeroberts · 2022-11-30T16:37:02Z

PR wasn't merged as it was superseded by the image processors and no longer needed.

amyeroberts added 7 commits August 5, 2022 14:14

Cast images to numpy arrays in call to enable consistent behaviour wi…

134e7a7

…th different configs

Cast frames to numpy arrays in call to enable consistent behaviour wi…

94515a2

…th different configs

Remove accidental clip changes

1ace93b

Remove accidental clip changes

6e21b7d

Merge branch 'type-cast-before-normalize-update-methods' into type-ca…

51e4959

…st-before-normalize-videomae

Make sure defaults are the same as before

2b967ed

amyeroberts requested review from NielsRogge, alaradirik and LysandreJik August 19, 2022 11:04

alaradirik reviewed Aug 19, 2022

View reviewed changes

amyeroberts and others added 4 commits August 19, 2022 15:24

Resolve merge conflicts with main

665c942

Update tests/models/videomae/test_feature_extraction_videomae.py

2530dbf

Co-authored-by: Alara Dirik <8944735+alaradirik@users.noreply.github.com>

Update tests/models/videomae/test_feature_extraction_videomae.py

0bf8aaf

Co-authored-by: Alara Dirik <8944735+alaradirik@users.noreply.github.com>

Update tests/models/videomae/test_feature_extraction_videomae.py

c71e46c

Co-authored-by: Alara Dirik <8944735+alaradirik@users.noreply.github.com>

amyeroberts requested a review from patrickvonplaten August 19, 2022 15:19

patrickvonplaten reviewed Aug 19, 2022

View reviewed changes

patrickvonplaten approved these changes Aug 19, 2022

View reviewed changes

Don't always rescale

6b264fd

github-actions bot closed this Oct 6, 2022

amyeroberts deleted the type-cast-before-normalize-videomae branch November 30, 2022 16:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Type cast before normalize #18694

Type cast before normalize #18694

amyeroberts commented Aug 19, 2022

alaradirik left a comment

HuggingFaceDocBuilderDev commented Aug 19, 2022

patrickvonplaten Aug 19, 2022

amyeroberts Aug 23, 2022

alaradirik Oct 7, 2022

patrickvonplaten Aug 19, 2022

patrickvonplaten Aug 19, 2022

amyeroberts Aug 23, 2022

patrickvonplaten Aug 19, 2022

amyeroberts Aug 23, 2022

amyeroberts Aug 23, 2022

amyeroberts Aug 23, 2022

patrickvonplaten Aug 23, 2022

patrickvonplaten Aug 23, 2022

patrickvonplaten Aug 19, 2022 •

edited

Loading

patrickvonplaten left a comment

github-actions bot commented Sep 27, 2022

amyeroberts commented Nov 30, 2022

Type cast before normalize #18694

Type cast before normalize #18694

Conversation

amyeroberts commented Aug 19, 2022

What does this PR do?

Details

Before submitting

alaradirik left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Aug 19, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patrickvonplaten Aug 19, 2022 • edited Loading

Choose a reason for hiding this comment

patrickvonplaten left a comment

Choose a reason for hiding this comment

github-actions bot commented Sep 27, 2022

amyeroberts commented Nov 30, 2022

patrickvonplaten Aug 19, 2022 •

edited

Loading