Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update feature extractor methods to enable type cast before normalize #18499

Conversation

amyeroberts
Copy link
Collaborator

@amyeroberts amyeroberts commented Aug 5, 2022

What does this PR do?

At the moment, the return type of our feature extractors isn't always as expected or sometimes fails if a do_xxx config flag is set to False. This PR introduces the necessary changes to the ImageFeatureExtractionMixin methods such that we can modify the feature extractor calls to fix this. This is an alternative solution to setting return_tensors="np" as default.

Each vision model using ImageFeatureExtractionMixin has a separate PR adding their necessary modifications and tests.

Details

At the moment, if do_normalize=False, do_resize=True and return_tensors=None then the output tensors will be a list of PIL.Image.Image objects if even if the inputs are numpy arrays. If do_normalize=False and return_tensors is specified ("pt", "np", "tf", "jax") an exception is raised.

The main reasons for this are:

  • BatchFeature can't convert PIL.Image.Image to the requested tensors.
  • The necessary conversion of PIL.Image.Image -> np.ndarray happens within the normalize method and the output of resize is PIL.Image.Image.

In order to have the type of the returned pixel_values reflect return_tensors we need to:

  • Convert PIL.Image.Image objects to numpy arrays before passing to BatchFeature
  • Be able to optionally rescale the inputs in the normalize method. If the input to normalize is a PIL.Image.Image it is converted to a numpy array using to_numpy_array which rescales to between [0, 1]. If do_resize=False then this rescaling won't happen if the inputs are numpy arrays.

The optional flags enable us to preserve the same default behaviour for the resize and normalize methods whilst modifying the internal logic of the feature extractor call.

Checks

The model PRs are all cherry picked (file diffs) of type-cast-before-normalize

The following was run to check the outputs:

from dataclasses import dataclass

import requests
import numpy as np
from PIL import Image
import pygit2
from transformers import AutoFeatureExtractor

@dataclass
class FeatureExtractorConfig:
    model_name: str
    checkpoint: str
    return_type: str = "np"
    feat_name: str = "pixel_values"

IMAGE_FEATURE_EXTRACTOR_CONFIGS = [
    FeatureExtractorConfig(model_name="clip", checkpoint="openai/clip-vit-base-patch32"),
    FeatureExtractorConfig(model_name="convnext", checkpoint="facebook/convnext-tiny-224"),
    FeatureExtractorConfig(model_name="deit", checkpoint="facebook/deit-base-distilled-patch16-224"),
    FeatureExtractorConfig(model_name="detr", checkpoint="facebook/detr-resnet-50"),
    FeatureExtractorConfig(model_name="dpt", checkpoint="Intel/dpt-large"),
    FeatureExtractorConfig(model_name="flava", checkpoint="facebook/flava-full"),
    FeatureExtractorConfig(model_name="glpn", checkpoint="vinvino02/glpn-kitti"),
    FeatureExtractorConfig(model_name="imagegpt", checkpoint="openai/imagegpt-small", feat_name='input_ids'),
    FeatureExtractorConfig(model_name="layoutlmv2", checkpoint="microsoft/layoutlmv2-base-uncased"),
    FeatureExtractorConfig(model_name="layoutlmv3", checkpoint="microsoft/layoutlmv3-base"),
    FeatureExtractorConfig(model_name="levit", checkpoint="facebook/levit-128S"),
    FeatureExtractorConfig(model_name="maskformer", checkpoint="facebook/maskformer-swin-base-ade", return_type="pt"),
    FeatureExtractorConfig(model_name="mobilevit", checkpoint="apple/mobilevit-small"),
    FeatureExtractorConfig(model_name="owlvit", checkpoint="google/owlvit-base-patch32"),
    FeatureExtractorConfig(model_name="perceiver", checkpoint="deepmind/vision-perceiver-fourier"),
    FeatureExtractorConfig(model_name="poolformer", checkpoint="sail/poolformer_s12"),
    FeatureExtractorConfig(model_name="segformer", checkpoint="nvidia/mit-b0"),
    FeatureExtractorConfig(model_name="vilt", checkpoint="dandelin/vilt-b32-mlm"),
    FeatureExtractorConfig(model_name="vit", checkpoint="google/vit-base-patch16-224-in21k"),
    FeatureExtractorConfig(model_name="yolos", checkpoint="hustvl/yolos-small"),
]

VIDEO_FEATURE_EXTRACTOR_CONFIGS = [
	FeatureExtractorConfig(model_name="videomae", checkpoint="MCG-NJU/videomae-base"),
]

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

def produce_pixel_value_outputs():
    BRANCH = pygit2.Repository('.').head.shorthand

    def get_processed_outputs(inputs, model_checkpoint, feat_name):
        feature_extractor = AutoFeatureExtractor.from_pretrained(model_checkpoint)
        outputs = feature_extractor(inputs, return_tensors=fe_config.return_type)[feat_name]
        return outputs

    for fe_config in IMAGE_FEATURE_EXTRACTOR_CONFIGS:
        print(fe_config.model_name, fe_config.checkpoint)
        outputs = get_processed_outputs(image, fe_config.checkpoint, fe_config.feat_name)
        np.save(f"{fe_config.model_name}_{BRANCH.replace('-', '_')}_pixel_values.npy", outputs)

    for fe_config in VIDEO_FEATURE_EXTRACTOR_CONFIGS:
        print(fe_config.model_name, fe_config.checkpoint)
        outputs = get_processed_outputs([[image, image]], fe_config.checkpoint, fe_config.feat_name)
        np.save(f"{fe_config.model_name}_{BRANCH.replace('-', '_')}_pixel_values.npy", outputs)

branch_main = "main"
branch_feature = "type-cast-before-normalize"

repo = pygit2.Repository('.git')

print("\nChecking out main")
branch = repo.lookup_branch('main')
ref = repo.lookup_reference(branch.name)
repo.checkout(ref)

produce_pixel_value_outputs()

print("\nChecking out type-cast-before-normalize")
branch = repo.lookup_branch('type-cast-before-normalize')
ref = repo.lookup_reference(branch.name)
repo.checkout(ref)

produce_pixel_value_outputs()

for fe_config in IMAGE_FEATURE_EXTRACTOR_CONFIGS + VIDEO_FEATURE_EXTRACTOR_CONFIGS:
    model_name = fe_config.model_name

    try:
        output_1 = np.load(f"{model_name}_{branch_main}_pixel_values.npy")
        output_2 = np.load(f"{model_name}_{branch_feature.replace('-', '_')}_pixel_values.npy")

        max_diff = np.amax(np.abs(output_1 - output_2))
        print(f"{model_name}: {max_diff:.5f}")
    except Exception as e:
        print(f"{model_name} failed check with {e}")

Output:

clip: 0.00000
convnext: 0.00000
deit: 0.00000
detr: 0.00000
dpt: 0.00000
flava: 0.00000
glpn: 0.00000
imagegpt: 0.00000
layoutlmv2: 0.00000
layoutlmv3: 0.00000
levit: 0.00000
maskformer: 0.00000
mobilevit: 0.00000
owlvit: 0.00000
perceiver: 0.00000
poolformer: 0.00000
segformer: 0.00000
vilt: 0.00000
vit: 0.00000
yolos: 0.00000
videomae: 0.00000

Fixes

#17714
#15055

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests? (in model PRs)

This is necessary to allow for casting our images / videos to numpy arrays within the feature extractors' call. We want to do this to make sure the behaviour is as expected when flags like  are False. If some transformations aren't applied, then the output type can't be unexpected e.g. a list of PIL images instead of numpy arrays.
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Aug 5, 2022

The documentation is not available anymore as the PR was closed or merged.

We write a generic  function to handle rescaling of our arrays. In order for the API to be intuitive, we take some factor c and rescale the image values by that. This means, the rescaling done in normalize and to_numpy_array are now done with array * (1/255) instead of array / 255. This leads to small differences in the resulting image. When testing, this was in the order of 1e-8, and so deemed OK
@@ -58,13 +58,13 @@ def test_conversion_image_to_array(self):
array3 = feature_extractor.to_numpy_array(image, rescale=False)
self.assertTrue(array3.dtype, np.uint8)
self.assertEqual(array3.shape, (3, 16, 32))
self.assertTrue(np.array_equal(array1, array3.astype(np.float32) / 255.0))
self.assertTrue(np.array_equal(array1, array3.astype(np.float32) * (1 / 255.0)))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was changed to reflect the rescaling logic. The max difference between the arrays before this change was ~5e-8.

@amyeroberts amyeroberts changed the title Type cast before normalize update methods Update feature extractor methods to type cast before normalize Aug 8, 2022
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! If the changes per model are small enough, it would probably be best to change them all in the same PR, rather than doing individual ones.

Copy link
Contributor

@alaradirik alaradirik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Looks good to me too

@amyeroberts
Copy link
Collaborator Author

Looks good to me! If the changes per model are small enough, it would probably be best to change them all in the same PR, rather than doing individual ones.

@sgugger Yep, I completely agree. The changes all together aren't that small, but almost exactly the same across models. Once this is merged in, I'll open a PR for the VideoMAE refactor (https://github.com/amyeroberts/transformers/pull/9/files) as this covers all the changes. Once approved, I'll merge in the other models to the branch, as for re-review of the total PR and then merge all together.

@amyeroberts amyeroberts changed the title Update feature extractor methods to type cast before normalize Update feature extractor methods to enable type cast before normalize Aug 17, 2022
@amyeroberts amyeroberts merged commit 49e44b2 into huggingface:main Aug 17, 2022
"""
Rescale a numpy image by scale amount
"""
self._ensure_format_supported(image)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This checks whether the image is a PIL image, NumPy array or PyTorch tensor, but this method only expects NumPy arrays. So I'd update to raise a ValueError in case the image isn't an instance of np.ndarray.

amyeroberts added a commit to amyeroberts/transformers that referenced this pull request Aug 23, 2022
The image segmentation pipeline tests - tests/pipelines/test_pipelines_image_segmentation.py - were failing after the merging of huggingface#1849  (49e44b2). This was due to the difference in rescaling. Previously the images were rescaled by `image = image / 255`. In the new commit, a `rescale` method was added, and images rescaled using `image = image * scale`. This was known to cause small differences in the processed images (see
[PR comment](huggingface#18499 (comment))).

Testing locally, changing the `rescale` method to divide by a scale factor (255) resulted in the tests passing. It was therefore decided the test values could be updated, as there was no logic difference between the commits.
sgugger pushed a commit that referenced this pull request Sep 15, 2022
* Updated test values

The image segmentation pipeline tests - tests/pipelines/test_pipelines_image_segmentation.py - were failing after the merging of #1849  (49e44b2). This was due to the difference in rescaling. Previously the images were rescaled by `image = image / 255`. In the new commit, a `rescale` method was added, and images rescaled using `image = image * scale`. This was known to cause small differences in the processed images (see
[PR comment](#18499 (comment))).

Testing locally, changing the `rescale` method to divide by a scale factor (255) resulted in the tests passing. It was therefore decided the test values could be updated, as there was no logic difference between the commits.

* Use double quotes, like previous example

* Fix up
oneraghavan pushed a commit to oneraghavan/transformers that referenced this pull request Sep 26, 2022
…huggingface#18499)

* Update methods to optionally rescale
This is necessary to allow for casting our images / videos to numpy arrays within the feature extractors' call. We want to do this to make sure the behaviour is as expected when flags like  are False. If some transformations aren't applied, then the output type can't be unexpected e.g. a list of PIL images instead of numpy arrays.

* Cast images to numpy arrays in call to enable consistent behaviour with different configs

* Remove accidental clip changes

* Update tests to reflect the scaling logic
We write a generic  function to handle rescaling of our arrays. In order for the API to be intuitive, we take some factor c and rescale the image values by that. This means, the rescaling done in normalize and to_numpy_array are now done with array * (1/255) instead of array / 255. This leads to small differences in the resulting image. When testing, this was in the order of 1e-8, and so deemed OK
oneraghavan pushed a commit to oneraghavan/transformers that referenced this pull request Sep 26, 2022
* Updated test values

The image segmentation pipeline tests - tests/pipelines/test_pipelines_image_segmentation.py - were failing after the merging of huggingface#1849  (49e44b2). This was due to the difference in rescaling. Previously the images were rescaled by `image = image / 255`. In the new commit, a `rescale` method was added, and images rescaled using `image = image * scale`. This was known to cause small differences in the processed images (see
[PR comment](huggingface#18499 (comment))).

Testing locally, changing the `rescale` method to divide by a scale factor (255) resulted in the tests passing. It was therefore decided the test values could be updated, as there was no logic difference between the commits.

* Use double quotes, like previous example

* Fix up
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants