Add Vision Transformer + ViTFeatureExtractor #10513

NielsRogge · 2021-03-04T12:21:22Z

What does this PR do?

This PR includes 2 things:

it adds the Vision Transformer (ViT) by Google Brain. ViT is a Transformer encoder trained on ImageNet. It is capable of classifying images, by placing a linear classification head on top of the final hidden state of the [CLS] token. I converted the weights from the timm repository, which already took care of converting the weights of the original implementation (which is written in JAX) into PyTorch. Once this model is added, we can also add DeIT (Data-efficient Image Transformers) by Facebook AI, which improve upon ViT.
it provides a design for the ViTFeatureExtractor class, which can be used to prepare images for the model. It inherits from FeatureExtractionMixin and defines a __call__ method. It currently accepts 3 types of inputs: PIL images, Numpy arrays and PyTorch tensors. It defines 2 transformations using torchvision: resizing + normalization. It then returns a BatchFeature object with 1 key, namely pixel_values.

Demo notebook of combination of ViTForImageClassification + ViTFeatureExtractor: https://colab.research.google.com/drive/16TCM-tJ1Mfhs00Qas063kWZmAtVJcOeP?usp=sharing

Compared to NLP models (which accept input_ids, attention_mask and token_type_ids), this model only accepts pixel_values. The model itself then converts these pixel values into patches (in case of ViT) in the ViTEmbeddings class.

Help needed

Would be great if you can help me with the following tasks:

Add and improve tests. Currently I have defined the following tests: test_modeling_vit.py, test_feature_extraction_vit.py. However, for the former, since ViT does not use input_ids/input_embeds, some tests are failing, so I wonder whether it should use all tests defined in test_modeling_common.py. For the latter, I also need some help in creating random inputs to test the feature extractor on.
Add support for head_mask in the forward of ViTModel. Possibly remove attention_mask?
Run make fix-copies (doesn't work right now for me on Windows)
Remove the is_decoder logic from modeling_vit.py (since the model was created using the CookieCutter template). I assume that things such as past_key_values are not required for an encoder-only model.

Who can review?

@patrickvonplaten @LysandreJik @sgugger

src/transformers/models/vit/feature_extraction_vit.py

patil-suraj · 2021-03-16T11:43:38Z

Hey @NielsRogge

Add and improve tests. Currently I have defined the following tests: test_modeling_vit.py, test_feature_extraction_vit.py. However, for the former, since ViT does not use input_ids/input_embeds, some tests are failing, so I wonder whether it should use all tests defined in test_modeling_common.py. For the latter, I also need some help in creating random inputs to test the feature extractor on.

Some common modeling test depend on the specific parameter names, (input_ids, input_embeds). You could just override such tests in your test class and use the correct parameter names. For example the test_forward_signature test
expects inputs_ids, so it should be overridden in your class to expect input_values.

Also, the tests for input_embeds (for example test_inputs_embeds) can be skipped since ViT does not use those. Agin just overrides the test and use pass in the method body.

You could use the modeling tests of Wav2Vec2 and Speech2Text for reference since those models also use different parameter names.

patil-suraj · 2021-03-16T12:46:50Z

I like the overall design ViTFeatureExtractor. Regrading the import ViTFeatureExtractor
I think it should be always imported in the init files, and instead, ViTFeatureExtractor could check for torchvision and raise if it’s not installed. Otherwise, the TF tests on CI will fail because they won’t be able to import ViTFeatureExtractor as we don’t install torchvision in TF tests.

We should also add the torchvision and PIL dependency in the setup.py file as extras["vision"] and also add it in config.yaml for CI

sgugger

Thanks a lot for adding this model! The main problem I have is with the self.self for the self-attention. It's there in BERT and there is nothing we can do about it now, but we can still make sure to use another name in newer models!

src/transformers/__init__.py

src/transformers/models/vit/configuration_vit.py

sgugger · 2021-03-22T16:12:59Z

src/transformers/models/vit/convert_vit_timm_to_pytorch.py

+        "norm.weight",
+        "norm.bias",
+        "head.weight",
+        "head.bias",


This can all fit in one line.

I know, it's make style that does this.

Just tested locally and it did not change the line

ignore_keys = ["norm.weight", "norm.bias", "head.weight", "head.bias"]

src/transformers/models/vit/feature_extraction_vit.py

src/transformers/models/vit/modeling_vit.py

LysandreJik

This looks really good, fantastic job @NielsRogge!

Related to the modelcard:

The model doesn't have a model card as of now, it would be amazing to have it
The model configuration on your nielsr/vit-base-patch16-224 repo has all the labels as "LABELS_{i}", it would be great to have the actual label names!

Other than that it looks in very good shape! I'm wondering about the feature processor as I understand it, it's not framework agnostic. Also, the AutoModel is a very low hanging fruit when we already have the mapping.

.circleci/config.yml

src/transformers/models/vit/modeling_vit.py

LysandreJik · 2021-03-22T15:51:05Z

docs/source/index.rst

@@ -319,6 +323,8 @@ TensorFlow and/or Flax.
 +-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
 |       Transformer-XL        |       ✅       |       ❌       |       ✅        |         ✅         |      ❌      |
 +-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
+|             ViT             |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |


Seeing this is looks like the ViT support is quite incomplete, even though it's not the case. I think we should eventually rethink how this is designed so that feature processors are highlighted here. Maybe by modifying "Tokenizer slow" to be "Pre-processor" and "Tokenizer fast" to be "Performance-optimized pre-processor". Let's think about it cc @sgugger

This is for a further PR though ;-) But yes, definitely worth a look!

setup.py

src/transformers/__init__.py

src/transformers/models/auto/modeling_auto.py

src/transformers/models/vit/configuration_vit.py

LysandreJik · 2021-03-22T16:07:44Z

src/transformers/models/vit/modeling_vit.py

+        super().__init__()
+        image_size = to_2tuple(image_size)
+        patch_size = to_2tuple(patch_size)
+        num_patches = (image_size[1] // patch_size[1]) * (image_size[0] // patch_size[0])


A comment here would be helpful

src/transformers/models/vit/feature_extraction_vit.py

src/transformers/testing_utils.py

NielsRogge · 2021-03-22T18:44:21Z

Thanks for the reviews, addressed most of the comments. To do:

rename self.self to self.attention and update conversion script accordingly
convert more models, place them under the google namespace
add model cards
add 1,000 ImageNet class names to config

docs/source/model_doc/vit.rst

sgugger

There are multiple instances of weird styling. In general we use the 119-chars line to its maximum (you can add a ruler in your IDE to see where it is). Sadly make style does not put back code split into several lines back in one line if you are using code copied from another part of the lib as a base (where the split might be justified because there were more objects or longer names in the original) so it has to be done by hand.

sgugger · 2021-03-25T13:08:29Z

docs/source/index.rst

@@ -319,6 +323,8 @@ TensorFlow and/or Flax.
 +-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
 |       Transformer-XL        |       ✅       |       ❌       |       ✅        |         ✅         |      ❌      |
 +-----------------------------+----------------+----------------+-----------------+--------------------+--------------+
+|             ViT             |       ❌       |       ❌       |       ✅        |         ❌         |      ❌      |


This is for a further PR though ;-) But yes, definitely worth a look!

docs/source/model_doc/vit.rst

src/transformers/__init__.py

src/transformers/models/vit/convert_vit_timm_to_pytorch.py

sgugger · 2021-03-25T13:28:19Z

src/transformers/models/vit/modeling_vit.py

+        embedding_output = self.embeddings(
+            pixel_values,
+        )


Suggested change

embedding_output = self.embeddings(

pixel_values,

)

embedding_output = self.embeddings(pixel_values)

sgugger · 2021-03-25T13:30:04Z

src/transformers/testing_utils.py

+    """
+    Decorator marking a test that requires torchvision.
+
+    These tests are skipped when torchvision isn't installed.
+
+    """


Suggested change

"""

Decorator marking a test that requires torchvision.

These tests are skipped when torchvision isn't installed.

"""

"""

Decorator marking a test that requires torchvision. These tests are skipped when torchvision isn't installed.

"""

sgugger · 2021-03-25T13:31:06Z

tests/test_modeling_vit.py

+        (
+            config,
+            pixel_values,
+            labels,
+        ) = config_and_inputs


Suggested change

(

config,

pixel_values,

labels,

) = config_and_inputs

config, pixel_values, labels = config_and_inputs

tests/test_modeling_vit.py

LysandreJik

Cool, this looks great! Looking forward to seeing @sgugger's take on the feature processor.

Played around with the model a bit, it's fun! Great job on the implementation @NielsRogge!

src/transformers/models/auto/__init__.py

src/transformers/models/auto/modeling_auto.py

src/transformers/models/vit/__init__.py

NielsRogge · 2021-03-26T09:04:32Z

I've addressed all comments. Most important updates:

moved the ImageNet id to classes dict to a new file under transformers.utils named imagenet_classes.py.
added a warning to the __call__ method of ViTFeatureExtractor to indicate that NumPy arrays and PyTorch tensors are converted to PIL images when resizing, so it's most efficient to pass in PIL images.

The remaining comments which are still open have to do with styling. I seem to have some issues with make style. The max_length is set to 119, so not sure what's causing this.

NielsRogge changed the title ~~Add Vision Transformer + PreTrainedImageProcessor~~ [WIP] Add Vision Transformer + PreTrainedImageProcessor Mar 4, 2021

NielsRogge force-pushed the modeling_vit_pytorch_v2 branch from 392546f to 7b4f1c7 Compare March 4, 2021 17:21

patrickvonplaten mentioned this pull request Mar 8, 2021

[FeatureExtractorSavingUtils] Refactor PretrainedFeatureExtractor #10594

Merged

NielsRogge force-pushed the modeling_vit_pytorch_v2 branch 2 times, most recently from dfc6660 to 7d3fff0 Compare March 16, 2021 09:36

NielsRogge changed the title ~~[WIP] Add Vision Transformer + PreTrainedImageProcessor~~ [WIP] Add Vision Transformer + ViTFeatureExtractor Mar 16, 2021

NielsRogge commented Mar 16, 2021

View reviewed changes

src/transformers/models/vit/feature_extraction_vit.py Outdated Show resolved Hide resolved

NielsRogge force-pushed the modeling_vit_pytorch_v2 branch 5 times, most recently from a830014 to de78130 Compare March 19, 2021 16:48

NielsRogge added 16 commits March 19, 2021 20:58

Fix rebase with master

ffa4bf8

Add List typing hint

a8d48c2

Remove annotations

8363469

Potential bug fix

f4e4fb3

Bug fix

ff97a92

Rename inputs to pixel_values

46ea2b6

First draft of ImageProcessor tests

b6fba1c

Clean up: remove print statements, remove unused variables

bc6f12d

Remove load_tf_weights_in_vit

56ccfa8

Rename pixel_mask to attention_mask

dc3c23f

Improve tests

d3607b4

Small cleanup

dca36be

Remove is_decoder logic and make style

9f40352

Fix another rebase issue

6da3261

Fix another rebase issue

d48609a

Major cleanup - renamed ViTImageProcessor to ViTFeatureExtractor

43524d0

NielsRogge added 4 commits March 19, 2021 20:58

Minor fixes

a7a9e0e

Improve tests

466cef1

All tests are passing

647f0e4

Make style & quality, docs improvements

e01294c

NielsRogge force-pushed the modeling_vit_pytorch_v2 branch from de78130 to e01294c Compare March 19, 2021 19:58

NielsRogge and others added 4 commits March 20, 2021 11:00

Remove attention mask, add support for head mask

0e02f64

Merge branch 'master' into modeling_vit_pytorch_v2

02c06bc

Some docs improvements + clearer input checking for ViTFeatureExtractor

f5ba2f4

Change normalization to match original implementation

852b777

NielsRogge changed the title ~~[WIP] Add Vision Transformer + ViTFeatureExtractor~~ Add Vision Transformer + ViTFeatureExtractor Mar 22, 2021

NielsRogge added 3 commits March 22, 2021 16:47

Fix bugs in tests

03b7638

One more bug fix

f6556b5

Revert previous change

884b7a7

sgugger approved these changes Mar 22, 2021

View reviewed changes

LysandreJik reviewed Mar 22, 2021

View reviewed changes

src/transformers/testing_utils.py Outdated Show resolved Hide resolved

Address most comments by @sgugger @LysandreJik

f35360e

NielsRogge added 4 commits March 23, 2021 09:46

Update conversion script

f9a1ac6

Rename self.self to self.attention

472f96d

Add pooler option to ViTForImageClassification, improve docs

e790c1d

Add ViTFeatureExtractor to conversion script

8b95a1e

sgugger reviewed Mar 24, 2021

View reviewed changes

docs/source/model_doc/vit.rst Show resolved Hide resolved

docs/source/model_doc/vit.rst Outdated Show resolved Hide resolved

Add copyright

37ae119

sgugger reviewed Mar 25, 2021

View reviewed changes

LysandreJik reviewed Mar 25, 2021

View reviewed changes

src/transformers/models/auto/__init__.py Show resolved Hide resolved

src/transformers/models/auto/modeling_auto.py Show resolved Hide resolved

src/transformers/models/vit/__init__.py Outdated Show resolved Hide resolved

Address additional comments

c6c0f27

NielsRogge closed this Mar 29, 2021

NielsRogge mentioned this pull request Mar 29, 2021

Add Vision Transformer and ViTFeatureExtractor #10950

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Vision Transformer + ViTFeatureExtractor #10513

Add Vision Transformer + ViTFeatureExtractor #10513

NielsRogge commented Mar 4, 2021 •

edited

Loading

patil-suraj commented Mar 16, 2021 •

edited

Loading

patil-suraj commented Mar 16, 2021 •

edited

Loading

sgugger left a comment

sgugger Mar 22, 2021

NielsRogge Mar 22, 2021

sgugger Mar 25, 2021

LysandreJik left a comment

LysandreJik Mar 22, 2021

sgugger Mar 25, 2021

LysandreJik Mar 22, 2021

NielsRogge commented Mar 22, 2021 •

edited

Loading

sgugger left a comment

sgugger Mar 25, 2021

sgugger Mar 25, 2021

sgugger Mar 25, 2021

sgugger Mar 25, 2021

LysandreJik left a comment

NielsRogge commented Mar 26, 2021

Add Vision Transformer + ViTFeatureExtractor #10513

Add Vision Transformer + ViTFeatureExtractor #10513

Conversation

NielsRogge commented Mar 4, 2021 • edited Loading

What does this PR do?

Help needed

Who can review?

patil-suraj commented Mar 16, 2021 • edited Loading

patil-suraj commented Mar 16, 2021 • edited Loading

sgugger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NielsRogge commented Mar 22, 2021 • edited Loading

sgugger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

NielsRogge commented Mar 26, 2021

NielsRogge commented Mar 4, 2021 •

edited

Loading

patil-suraj commented Mar 16, 2021 •

edited

Loading

patil-suraj commented Mar 16, 2021 •

edited

Loading

NielsRogge commented Mar 22, 2021 •

edited

Loading