Add ViLT #14895

NielsRogge · 2021-12-23T11:11:40Z

What does this PR do?

This PR adds ViLT (Vision and Language Transformer).

It's a very nice, minimal multi-modal model, as it only adds a text embedding layer to an existing ViT.

I've defined the following head models:

ViltForMaskedLM
ViltForVisualQuestionAnswering
ViltForNaturalLanguageVisualReasoning
ViltForImageRetrievalTextRetrieval (CLIP-like model).

To do:

add ViltForNaturalLanguageVisualReasoning to the tests. However, I do have a question here: it's the only model that requires config.modality_type_vocab_size = 3 instead of 2. How can I handle this exception in the tests? I could do it like this:

for model_class in self.all_model_classes:
    config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
    config.return_dict = True

    if model_class.__name__ == "ViltForNaturalLanguageVisualReasoning":
       config.modality_type_vocab_size = 3

But that's not ideal as it would require overwrite each individual test.

Update: fixed by create a separate ModelTester for this particular model, that overrides the get_config.

LysandreJik

This looks good, thanks for working on it @NielsRogge!

I left a few comments, and would love @sgugger's review before this is merged.

docs/source/index.rst

docs/source/model_doc/vilt.mdx

src/transformers/models/vilt/configuration_vilt.py

src/transformers/models/vilt/modeling_vilt.py

LysandreJik · 2022-01-10T09:15:06Z

utils/check_repo.py

@@ -102,6 +102,9 @@
 # should **not** be the rule.
 IGNORE_NON_AUTO_CONFIGURED = PRIVATE_MODELS.copy() + [
    # models to ignore for model xxx mapping
+    "ViltForMaskedLM",


AutoModelForMaskeLM should correctly return this, no?

I asked @Narsil about this, but the AutoModelForMaskedLM doesn't currently accept models that take several modalities as input. ViLT can take in pixel_values and input_ids, and you can mask out several input_ids, which the model needs to predict. However, the "fill-mask" pipeline currently only works for models that only take input_ids as input.

The pipeline fill-mask doesn't work with multi modalities. And it assumes every AutoModelForMaskedLM is for filling text only masks.
As I mentionned too:

Either we don't make it AutoMaskedLM

Or we make it AutoMaskedLM but we need an escape hatch of some kind so that the pipeline can know it's not supposed to work (or it works and simply doesn't use the image, or has pure padded image or something along those lines)

AutoModel should work nonetheless (I assume this discards that)

tests/test_modeling_vilt.py

sgugger · 2022-01-10T15:12:52Z

I'd like the PR to be green (or mostly green) before reviewing.

NielsRogge · 2022-01-11T15:06:26Z

@sgugger should be mostly green now.

sgugger

Thanks a lot for adding this model!

Regarding your question for the tests, I think the easiest would be to define a new Tester and Test class for ViltForNaturalLanguageVisualReasoning that inehrits from the main Tester and Test class this PR adds, then overrides the method to get the config, so that you don't have to rewrite all the tests.

docs/source/_toctree.yml

docs/source/model_doc/vilt.mdx

sgugger · 2022-01-17T19:25:54Z

docs/source/model_doc/vilt.mdx

+[[autodoc]] ViltForVisualQuestionAnswering
+    - forward
+
+## ViltForNaturalLanguageVisualReasoning


I have no idea what Natural Language Visual Reasoning means, so there is probably a better name to find here.

It's because this model was fine-tuned on NLVR: https://lil.nlp.cornell.edu/nlvr/

It probably deserves a nice introduction in the docstring of that model.

docs/source/model_doc/vilt.mdx

src/transformers/models/vilt/processing_vilt.py

sgugger · 2022-01-17T19:36:49Z

src/transformers/models/vit/modeling_vit.py

@@ -326,12 +326,6 @@ def forward(self, hidden_states, head_mask=None, output_attentions=False):

        # in ViT, layernorm is also applied after self-attention
        layer_output = self.layernorm_after(hidden_states)
-


Same comment as for Beit above.

tests/test_feature_extraction_vilt.py

tests/test_modeling_vilt.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

…eval

src/transformers/models/vilt/processing_vilt.py

sgugger · 2022-01-19T15:09:01Z

Note: with the new build dev job merged, you can preview the doc here :-)

HuggingFaceDocBuilder · 2022-01-19T18:52:21Z

Great job merging this PR! the documentation will now be removed from the staging environment.

NielsRogge requested a review from sgugger December 23, 2021 11:11

NielsRogge mentioned this pull request Jan 8, 2022

VQA model inferences #14929

Closed

LysandreJik reviewed Jan 10, 2022

View reviewed changes

NielsRogge force-pushed the modeling_vilt branch 2 times, most recently from 18c9637 to 4742d49 Compare January 11, 2022 13:43

NielsRogge force-pushed the modeling_vilt branch from afc75d2 to d66f5bf Compare January 17, 2022 17:24

sgugger approved these changes Jan 17, 2022

View reviewed changes

NielsRogge added 21 commits January 19, 2022 10:51

First commit

9a7353d

Add conversion script

c1f3cf9

Make conversion script work for base model

878a72e

More improvements

60f5118

Update conversion script, works for vqa

034de15

Add indexing argument to meshgrid

bf39fc4

Make conversion script work for ViltForPreTraining

31f9f47

Add ViltForPreTraining to docs

8c66e9b

Fix device issue

dd407d9

Add processor

793f60e

Add MinMaxResize to feature extractor

747d58e

Implement call method of ViltProcessor

d3c8467

Fix tests

93fb9bf

Add integration test

b906a81

Add loss calculation for VQA

93ea742

Improve tests

317cc41

Improve some more tests

94bf37a

Debug tests

5973121

Small improvements

f0166a5

Add support for attention_mask

5bf074e

Remove mask_it

532c35c

NielsRogge and others added 19 commits January 19, 2022 10:52

Properly convert masked language model

56eaf1f

Add integration test for nlvr

5946eb0

Fix code quality

58fec07

Apply suggestions from code review

4af9418

Add copied from statements

b66397d

Fix pretrained_config_archive_map

66c42b7

Fix docs

fca433d

Add model to README

f28d1e6

Apply suggestions from code review

b57af9a

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Apply more suggestions from code review

d8a3a74

Make code more readable

4c7c173

Add ViltForNaturalLanguageVisualReasoning to the tests

5454124

Rename ViltForVisualQuestionAnswering to ViltForQuestionAnswering

c4821f8

Replace pixel_values_2 by single tensor

3649faf

Add hidden_states and attentions

1cf9db4

Fix one more test

213be96

Fix all tests

f089e0e

Update year

2659fd1

Fix rebase issues

7a3b2bb

NielsRogge force-pushed the modeling_vilt branch from ef6b57c to 7a3b2bb Compare January 19, 2022 10:22

NielsRogge added 4 commits January 19, 2022 12:12

Fix another rebase issue

7406693

Remove ViltForPreTraining from auto mapping

be133d0

Rename ViltForImageRetrievalTextRetrieval to ViltForImageAndTextRetri…

53031c9

…eval

Make it possible to use BertTokenizerFast in the processor

e03a61c

NielsRogge commented Jan 19, 2022

View reviewed changes

src/transformers/models/vilt/processing_vilt.py Outdated Show resolved Hide resolved

NielsRogge added 2 commits January 19, 2022 17:15

Use BertTokenizerFast by default

b7ad5dc

Rename ViltForNaturalLanguageVisualReasoning, define custom model output

a583a19

NielsRogge merged commit ac22709 into huggingface:master Jan 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ViLT #14895

Add ViLT #14895

NielsRogge commented Dec 23, 2021 •

edited

Loading

LysandreJik left a comment

LysandreJik Jan 10, 2022

NielsRogge Jan 11, 2022 •

edited

Loading

Narsil Jan 11, 2022 •

edited

Loading

sgugger commented Jan 10, 2022

NielsRogge commented Jan 11, 2022

sgugger left a comment

sgugger Jan 17, 2022

NielsRogge Jan 18, 2022

sgugger Jan 18, 2022

sgugger Jan 17, 2022

sgugger commented Jan 19, 2022

HuggingFaceDocBuilder commented Jan 19, 2022

		@@ -326,12 +326,6 @@ def forward(self, hidden_states, head_mask=None, output_attentions=False):

		# in ViT, layernorm is also applied after self-attention
		layer_output = self.layernorm_after(hidden_states)

Add ViLT #14895

Add ViLT #14895

Conversation

NielsRogge commented Dec 23, 2021 • edited Loading

What does this PR do?

LysandreJik left a comment

Choose a reason for hiding this comment

LysandreJik Jan 10, 2022

Choose a reason for hiding this comment

NielsRogge Jan 11, 2022 • edited Loading

Choose a reason for hiding this comment

Narsil Jan 11, 2022 • edited Loading

Choose a reason for hiding this comment

sgugger commented Jan 10, 2022

NielsRogge commented Jan 11, 2022

sgugger left a comment

Choose a reason for hiding this comment

sgugger Jan 17, 2022

Choose a reason for hiding this comment

NielsRogge Jan 18, 2022

Choose a reason for hiding this comment

sgugger Jan 18, 2022

Choose a reason for hiding this comment

sgugger Jan 17, 2022

Choose a reason for hiding this comment

sgugger commented Jan 19, 2022

HuggingFaceDocBuilder commented Jan 19, 2022

NielsRogge commented Dec 23, 2021 •

edited

Loading

NielsRogge Jan 11, 2022 •

edited

Loading

Narsil Jan 11, 2022 •

edited

Loading