Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🚨 Fix torch.jit.trace for interpolate_pos_encoding in all vision models #33226

Merged
merged 12 commits into from
Sep 5, 2024

Conversation

xenova
Copy link
Contributor

@xenova xenova commented Aug 30, 2024

What does this PR do?

A much needed overhaul of interpolate_pos_encoding to remove python type casts and use the size variant of torch.nn.interpolate instead of scale_factor, which was error prone. This is done by abstracting the function into a separate vision utils file to account for class embeddings, upcasting before interpolation, etc., The option to copy this function across files was there, but this is a lot cleaner (imo) and will prevent such issues in future.

This PR has the following benefits:

  • Fix TorchScript/torch.jit.trace to support dynamic shapes by avoiding python typecasts and ensuring the correct branch is taken for interpolation. Among other things, this means these vision models can be exported to ONNX with dynamic shapes enabled.
  • Remove hacky + 0.1 offset to prevent precision issues (original issue. This was originally done in dinov1 and then corrected in dinov2 (here), which this implementation takes advantage of.
  • Fixes missing self.config=config in SwinEmbeddings
  • This will produce more accurate results when running inference at dimensions that the models weren't trained on. This is because the +0.1 offset, combined with the scale_factor produced slightly off-center values.
  • Remove unnecessary assertion for checking size after interpolation

Fixes #33181 #32410

Linked PRs in Optimum:

Overview of models

model_type num_class_embeds fixed ONNX exportable model_id comments
beit 1 microsoft/beit-base-patch16-224-pt22k-ft22k
blip 1 ONNX export not supported by Optimum
blip_2 1 ONNX export not supported by Optimum
data2vec 1 hf-internal-testing/tiny-random-Data2VecVisionModel
deit 2 facebook/deit-tiny-distilled-patch16-224
vit_hybrid 1 deprecated
dinov2 1 facebook/dinov2-small-imagenet1k-1-layer
donut 1 hf-internal-testing/tiny-random-DonutSwinModel
dpt 1 Intel/dpt-hybrid-midas has separate _resize_pos_embed method
flava 1 ONNX export not supported by Optimum
groupvit 0 hf-internal-testing/tiny-random-GroupViTModel
hiera 0 facebook/hiera-tiny-224-in1k-hf
idefics n/a n/a n/a relies on vision_model's forward call
instructblip 1 ONNX export not supported by Optimum
instructblipvideo 1 ONNX export not supported by Optimum
maskformer 1 facebook/maskformer-swin-tiny-coco
perceiver 0 Custom interpolate function
pvt 0 Custom interpolate function
seggpt 0 ONNX export not supported by Optimum; Custom interpolate function
siglip 0 google/siglip-base-patch16-224
swin 1 microsoft/swin-tiny-patch4-window7-224 - SwinEmbeddings was missing self.config=config
swinv2 1 microsoft/swinv2-tiny-patch4-window16-256
tvp 0 custom implementation; suffers from precision issue that dino fixed;
vit 1 google/vit-base-patch16-224
vit_mae 1 ONNX export not supported by Optimum
vit_msn 1 hf-internal-testing/tiny-random-ViTMSNForImageClassification
vivit 1 ONNX export not supported by Optimum

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@amyeroberts @NielsRogge @merveenoyan @qubvel

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@xenova
Copy link
Contributor Author

xenova commented Aug 30, 2024

Failing tests seems unrelated (beam search; nothing to do with this PR).

Edit: For 100% backwards compatibility, perhaps we should add some "legacy" config value, and default to the previous version (with +0.1 offset), like done here?

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for opening this PR, the detailed write up and fixing this for all of these models!

Overall the change looks good and it's great to have this finally fixed ❤️

Two comments:

  • I'm not sure about the abstraction of the logic into modeling_vision_utils.py module. "modeling_vision_utils" isn't a very well-defined utlity module. Utils have a tendency to gather code-dust and the line between what is a vision utility versus a common pattern for modeling isn't clear-cut. Unlike other utilities e.g. attention masks or rope, which are more model independent and it's clear what type of objects should belong in that file. It's more in-line with transformers to update the logic in all of these models using #Copied from statements.
  • We shouldn't remove the docstrings from the public methods

@amyeroberts
Copy link
Collaborator

Edit: For 100% backwards compatibility, perhaps we should add some "legacy" config value, and default to the previous version (with +0.1 offset), like done here?

In this case, as not using scale_factor and the 0.1 offset is better and more correct, I'd say we could just consider this a fix and update without having to handle BC. We should mark the PR with a 🚨 prefix in this case to flag for when we're preparing the release notes.

@SangbumChoi
Copy link
Contributor

SangbumChoi commented Sep 3, 2024

@xenova Hi xenova. I'm highly interested converting and also double-check the ONNX format inference in maskformer, mask2former, etc..(Vision specific model). Could you share the CLI format that you inserted to convert maskformer in here?

@xenova
Copy link
Contributor Author

xenova commented Sep 3, 2024

@amyeroberts:

It's more in-line with transformers to update the logic in all of these models using #Copied from statements.

Sure I can do that! I will choose dino as the source for when num_class_embeds=1.

We shouldn't remove the docstrings from the public methods

Agreed - will add back!


@SangbumChoi Using huggingface/optimum#2001:

optimum-cli export onnx --model MODEL_ID_GOES_HERE

should work

@xenova
Copy link
Contributor Author

xenova commented Sep 4, 2024

Addressed comments @amyeroberts and tests pass now (was a flaky fail last time)

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beautiful - thanks for fixing this!

@xenova xenova changed the title Fix torch.jit.trace for interpolate_pos_encoding in all vision models 🚨 Fix torch.jit.trace for interpolate_pos_encoding in all vision models Sep 5, 2024
@xenova
Copy link
Contributor Author

xenova commented Sep 5, 2024

Added 🚨 to PR! Merging!

@xenova xenova merged commit c6d2848 into main Sep 5, 2024
23 checks passed
@xenova xenova deleted the fix-interpolate-pos-encoding branch September 5, 2024 14:17
@xenova xenova mentioned this pull request Sep 5, 2024
5 tasks
zucchini-nlp pushed a commit to zucchini-nlp/transformers that referenced this pull request Sep 6, 2024
…models (huggingface#33226)

* Fix `torch.jit.tracing` for `interpolate_pos_encoding` in all vision models

* Apply formatting

* Add missing `self.config = config`

* Fix copies

* Fix hiera interpolation unit test

* Formatting

* Update `_import_structure`

* make style

* Fix docstring

* Use `# Copied from` instead of utils

* DeiT variable renaming (`class_and_dist_pos_embed`)

* Fix Hiera `interpolate_pos_encoding`
@100
Copy link

100 commented Sep 6, 2024

Hi @xenova thanks for making this PR. Would it be possible to also add this fix for the models added in #29261 (i.e. Blip2VisionModelWithProjection)? Or if it's already supported let me know -- it seems the forward method in that model currently does not support the interpolate_pos_encoding argument even though the other BLIP-2 models do.

itazap pushed a commit to NielsRogge/transformers that referenced this pull request Sep 20, 2024
…models (huggingface#33226)

* Fix `torch.jit.tracing` for `interpolate_pos_encoding` in all vision models

* Apply formatting

* Add missing `self.config = config`

* Fix copies

* Fix hiera interpolation unit test

* Formatting

* Update `_import_structure`

* make style

* Fix docstring

* Use `# Copied from` instead of utils

* DeiT variable renaming (`class_and_dist_pos_embed`)

* Fix Hiera `interpolate_pos_encoding`
dataKim1201 pushed a commit to dataKim1201/transformers that referenced this pull request Oct 7, 2024
…models (huggingface#33226)

* Fix `torch.jit.tracing` for `interpolate_pos_encoding` in all vision models

* Apply formatting

* Add missing `self.config = config`

* Fix copies

* Fix hiera interpolation unit test

* Formatting

* Update `_import_structure`

* make style

* Fix docstring

* Use `# Copied from` instead of utils

* DeiT variable renaming (`class_and_dist_pos_embed`)

* Fix Hiera `interpolate_pos_encoding`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ONNX export may fail for Hiera due to math.sqrt and python type int casts
5 participants