Update feature extractor methods to enable type cast before normalize #18499

amyeroberts · 2022-08-05T21:18:13Z

What does this PR do?

At the moment, the return type of our feature extractors isn't always as expected or sometimes fails if a do_xxx config flag is set to False. This PR introduces the necessary changes to the ImageFeatureExtractionMixin methods such that we can modify the feature extractor calls to fix this. This is an alternative solution to setting return_tensors="np" as default.

Each vision model using ImageFeatureExtractionMixin has a separate PR adding their necessary modifications and tests.

Details

At the moment, if do_normalize=False, do_resize=True and return_tensors=None then the output tensors will be a list of PIL.Image.Image objects if even if the inputs are numpy arrays. If do_normalize=False and return_tensors is specified ("pt", "np", "tf", "jax") an exception is raised.

The main reasons for this are:

BatchFeature can't convert PIL.Image.Image to the requested tensors.
The necessary conversion of PIL.Image.Image -> np.ndarray happens within the normalize method and the output of resize is PIL.Image.Image.

In order to have the type of the returned pixel_values reflect return_tensors we need to:

Convert PIL.Image.Image objects to numpy arrays before passing to BatchFeature
Be able to optionally rescale the inputs in the normalize method. If the input to normalize is a PIL.Image.Image it is converted to a numpy array using to_numpy_array which rescales to between [0, 1]. If do_resize=False then this rescaling won't happen if the inputs are numpy arrays.

The optional flags enable us to preserve the same default behaviour for the resize and normalize methods whilst modifying the internal logic of the feature extractor call.

Checks

The model PRs are all cherry picked (file diffs) of type-cast-before-normalize

The following was run to check the outputs:

from dataclasses import dataclass

import requests
import numpy as np
from PIL import Image
import pygit2
from transformers import AutoFeatureExtractor

@dataclass
class FeatureExtractorConfig:
    model_name: str
    checkpoint: str
    return_type: str = "np"
    feat_name: str = "pixel_values"

IMAGE_FEATURE_EXTRACTOR_CONFIGS = [
    FeatureExtractorConfig(model_name="clip", checkpoint="openai/clip-vit-base-patch32"),
    FeatureExtractorConfig(model_name="convnext", checkpoint="facebook/convnext-tiny-224"),
    FeatureExtractorConfig(model_name="deit", checkpoint="facebook/deit-base-distilled-patch16-224"),
    FeatureExtractorConfig(model_name="detr", checkpoint="facebook/detr-resnet-50"),
    FeatureExtractorConfig(model_name="dpt", checkpoint="Intel/dpt-large"),
    FeatureExtractorConfig(model_name="flava", checkpoint="facebook/flava-full"),
    FeatureExtractorConfig(model_name="glpn", checkpoint="vinvino02/glpn-kitti"),
    FeatureExtractorConfig(model_name="imagegpt", checkpoint="openai/imagegpt-small", feat_name='input_ids'),
    FeatureExtractorConfig(model_name="layoutlmv2", checkpoint="microsoft/layoutlmv2-base-uncased"),
    FeatureExtractorConfig(model_name="layoutlmv3", checkpoint="microsoft/layoutlmv3-base"),
    FeatureExtractorConfig(model_name="levit", checkpoint="facebook/levit-128S"),
    FeatureExtractorConfig(model_name="maskformer", checkpoint="facebook/maskformer-swin-base-ade", return_type="pt"),
    FeatureExtractorConfig(model_name="mobilevit", checkpoint="apple/mobilevit-small"),
    FeatureExtractorConfig(model_name="owlvit", checkpoint="google/owlvit-base-patch32"),
    FeatureExtractorConfig(model_name="perceiver", checkpoint="deepmind/vision-perceiver-fourier"),
    FeatureExtractorConfig(model_name="poolformer", checkpoint="sail/poolformer_s12"),
    FeatureExtractorConfig(model_name="segformer", checkpoint="nvidia/mit-b0"),
    FeatureExtractorConfig(model_name="vilt", checkpoint="dandelin/vilt-b32-mlm"),
    FeatureExtractorConfig(model_name="vit", checkpoint="google/vit-base-patch16-224-in21k"),
    FeatureExtractorConfig(model_name="yolos", checkpoint="hustvl/yolos-small"),
]

VIDEO_FEATURE_EXTRACTOR_CONFIGS = [
	FeatureExtractorConfig(model_name="videomae", checkpoint="MCG-NJU/videomae-base"),
]

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

def produce_pixel_value_outputs():
    BRANCH = pygit2.Repository('.').head.shorthand

    def get_processed_outputs(inputs, model_checkpoint, feat_name):
        feature_extractor = AutoFeatureExtractor.from_pretrained(model_checkpoint)
        outputs = feature_extractor(inputs, return_tensors=fe_config.return_type)[feat_name]
        return outputs

    for fe_config in IMAGE_FEATURE_EXTRACTOR_CONFIGS:
        print(fe_config.model_name, fe_config.checkpoint)
        outputs = get_processed_outputs(image, fe_config.checkpoint, fe_config.feat_name)
        np.save(f"{fe_config.model_name}_{BRANCH.replace('-', '_')}_pixel_values.npy", outputs)

    for fe_config in VIDEO_FEATURE_EXTRACTOR_CONFIGS:
        print(fe_config.model_name, fe_config.checkpoint)
        outputs = get_processed_outputs([[image, image]], fe_config.checkpoint, fe_config.feat_name)
        np.save(f"{fe_config.model_name}_{BRANCH.replace('-', '_')}_pixel_values.npy", outputs)

branch_main = "main"
branch_feature = "type-cast-before-normalize"

repo = pygit2.Repository('.git')

print("\nChecking out main")
branch = repo.lookup_branch('main')
ref = repo.lookup_reference(branch.name)
repo.checkout(ref)

produce_pixel_value_outputs()

print("\nChecking out type-cast-before-normalize")
branch = repo.lookup_branch('type-cast-before-normalize')
ref = repo.lookup_reference(branch.name)
repo.checkout(ref)

produce_pixel_value_outputs()

for fe_config in IMAGE_FEATURE_EXTRACTOR_CONFIGS + VIDEO_FEATURE_EXTRACTOR_CONFIGS:
    model_name = fe_config.model_name

    try:
        output_1 = np.load(f"{model_name}_{branch_main}_pixel_values.npy")
        output_2 = np.load(f"{model_name}_{branch_feature.replace('-', '_')}_pixel_values.npy")

        max_diff = np.amax(np.abs(output_1 - output_2))
        print(f"{model_name}: {max_diff:.5f}")
    except Exception as e:
        print(f"{model_name} failed check with {e}")

Output:

clip: 0.00000
convnext: 0.00000
deit: 0.00000
detr: 0.00000
dpt: 0.00000
flava: 0.00000
glpn: 0.00000
imagegpt: 0.00000
layoutlmv2: 0.00000
layoutlmv3: 0.00000
levit: 0.00000
maskformer: 0.00000
mobilevit: 0.00000
owlvit: 0.00000
perceiver: 0.00000
poolformer: 0.00000
segformer: 0.00000
vilt: 0.00000
vit: 0.00000
yolos: 0.00000
videomae: 0.00000

Fixes

#17714
#15055

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests? (in model PRs)

This is necessary to allow for casting our images / videos to numpy arrays within the feature extractors' call. We want to do this to make sure the behaviour is as expected when flags like are False. If some transformations aren't applied, then the output type can't be unexpected e.g. a list of PIL images instead of numpy arrays.

…th different configs

HuggingFaceDocBuilderDev · 2022-08-05T21:28:36Z

The documentation is not available anymore as the PR was closed or merged.

We write a generic function to handle rescaling of our arrays. In order for the API to be intuitive, we take some factor c and rescale the image values by that. This means, the rescaling done in normalize and to_numpy_array are now done with array * (1/255) instead of array / 255. This leads to small differences in the resulting image. When testing, this was in the order of 1e-8, and so deemed OK

amyeroberts · 2022-08-08T15:02:28Z

tests/utils/test_image_utils.py

@@ -58,13 +58,13 @@ def test_conversion_image_to_array(self):
        array3 = feature_extractor.to_numpy_array(image, rescale=False)
        self.assertTrue(array3.dtype, np.uint8)
        self.assertEqual(array3.shape, (3, 16, 32))
-        self.assertTrue(np.array_equal(array1, array3.astype(np.float32) / 255.0))
+        self.assertTrue(np.array_equal(array1, array3.astype(np.float32) * (1 / 255.0)))


This was changed to reflect the rescaling logic. The max difference between the arrays before this change was ~5e-8.

sgugger

Looks good to me! If the changes per model are small enough, it would probably be best to change them all in the same PR, rather than doing individual ones.

alaradirik

Nice work! Looks good to me too

amyeroberts · 2022-08-11T11:58:35Z

Looks good to me! If the changes per model are small enough, it would probably be best to change them all in the same PR, rather than doing individual ones.

@sgugger Yep, I completely agree. The changes all together aren't that small, but almost exactly the same across models. Once this is merged in, I'll open a PR for the VideoMAE refactor (https://github.com/amyeroberts/transformers/pull/9/files) as this covers all the changes. Once approved, I'll merge in the other models to the branch, as for re-review of the total PR and then merge all together.

NielsRogge · 2022-08-22T18:30:08Z

src/transformers/image_utils.py

+        """
+        Rescale a numpy image by scale amount
+        """
+        self._ensure_format_supported(image)


This checks whether the image is a PIL image, NumPy array or PyTorch tensor, but this method only expects NumPy arrays. So I'd update to raise a ValueError in case the image isn't an instance of np.ndarray.

The image segmentation pipeline tests - tests/pipelines/test_pipelines_image_segmentation.py - were failing after the merging of huggingface#1849 (49e44b2). This was due to the difference in rescaling. Previously the images were rescaled by `image = image / 255`. In the new commit, a `rescale` method was added, and images rescaled using `image = image * scale`. This was known to cause small differences in the processed images (see [PR comment](huggingface#18499 (comment))). Testing locally, changing the `rescale` method to divide by a scale factor (255) resulted in the tests passing. It was therefore decided the test values could be updated, as there was no logic difference between the commits.

* Updated test values The image segmentation pipeline tests - tests/pipelines/test_pipelines_image_segmentation.py - were failing after the merging of #1849 (49e44b2). This was due to the difference in rescaling. Previously the images were rescaled by `image = image / 255`. In the new commit, a `rescale` method was added, and images rescaled using `image = image * scale`. This was known to cause small differences in the processed images (see [PR comment](#18499 (comment))). Testing locally, changing the `rescale` method to divide by a scale factor (255) resulted in the tests passing. It was therefore decided the test values could be updated, as there was no logic difference between the commits. * Use double quotes, like previous example * Fix up

…huggingface#18499) * Update methods to optionally rescale This is necessary to allow for casting our images / videos to numpy arrays within the feature extractors' call. We want to do this to make sure the behaviour is as expected when flags like are False. If some transformations aren't applied, then the output type can't be unexpected e.g. a list of PIL images instead of numpy arrays. * Cast images to numpy arrays in call to enable consistent behaviour with different configs * Remove accidental clip changes * Update tests to reflect the scaling logic We write a generic function to handle rescaling of our arrays. In order for the API to be intuitive, we take some factor c and rescale the image values by that. This means, the rescaling done in normalize and to_numpy_array are now done with array * (1/255) instead of array / 255. This leads to small differences in the resulting image. When testing, this was in the order of 1e-8, and so deemed OK

* Updated test values The image segmentation pipeline tests - tests/pipelines/test_pipelines_image_segmentation.py - were failing after the merging of huggingface#1849 (49e44b2). This was due to the difference in rescaling. Previously the images were rescaled by `image = image / 255`. In the new commit, a `rescale` method was added, and images rescaled using `image = image * scale`. This was known to cause small differences in the processed images (see [PR comment](huggingface#18499 (comment))). Testing locally, changing the `rescale` method to divide by a scale factor (255) resulted in the tests passing. It was therefore decided the test values could be updated, as there was no logic difference between the commits. * Use double quotes, like previous example * Fix up

amyeroberts added 3 commits August 5, 2022 14:14

Cast images to numpy arrays in call to enable consistent behaviour wi…

134e7a7

…th different configs

Remove accidental clip changes

1ace93b

amyeroberts commented Aug 8, 2022

View reviewed changes

amyeroberts changed the title ~~Type cast before normalize update methods~~ Update feature extractor methods to type cast before normalize Aug 8, 2022

amyeroberts requested review from sgugger, LysandreJik, NielsRogge and alaradirik August 8, 2022 15:08

sgugger approved these changes Aug 8, 2022

View reviewed changes

alaradirik approved these changes Aug 10, 2022

View reviewed changes

amyeroberts mentioned this pull request Aug 12, 2022

BaseImageProcessor amyeroberts/transformers#26

Merged

5 tasks

amyeroberts changed the title ~~Update feature extractor methods to type cast before normalize~~ Update feature extractor methods to enable type cast before normalize Aug 17, 2022

amyeroberts merged commit 49e44b2 into huggingface:main Aug 17, 2022

This was referenced Aug 18, 2022

Rename method to avoid clash with property #18677

Merged

Type cast before normalize #18694

Closed

NielsRogge reviewed Aug 22, 2022

View reviewed changes

This was referenced Aug 23, 2022

Update image segmentation pipeline test #18731

Merged

Revert to rescale and safely handle flag in owlvit config #18750

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update feature extractor methods to enable type cast before normalize #18499

Update feature extractor methods to enable type cast before normalize #18499

amyeroberts commented Aug 5, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Aug 5, 2022 •

edited

Loading

amyeroberts Aug 8, 2022

sgugger left a comment

alaradirik left a comment

amyeroberts commented Aug 11, 2022

NielsRogge Aug 22, 2022

Update feature extractor methods to enable type cast before normalize #18499

Update feature extractor methods to enable type cast before normalize #18499

Conversation

amyeroberts commented Aug 5, 2022 • edited Loading

What does this PR do?

Details

Checks

Fixes

Before submitting

HuggingFaceDocBuilderDev commented Aug 5, 2022 • edited Loading

amyeroberts Aug 8, 2022

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

alaradirik left a comment

Choose a reason for hiding this comment

amyeroberts commented Aug 11, 2022

NielsRogge Aug 22, 2022

Choose a reason for hiding this comment

amyeroberts commented Aug 5, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Aug 5, 2022 •

edited

Loading