Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ViLT #14895

Merged
merged 59 commits into from
Jan 19, 2022
Merged

Add ViLT #14895

merged 59 commits into from
Jan 19, 2022

Conversation

NielsRogge
Copy link
Contributor

@NielsRogge NielsRogge commented Dec 23, 2021

What does this PR do?

This PR adds ViLT (Vision and Language Transformer).

It's a very nice, minimal multi-modal model, as it only adds a text embedding layer to an existing ViT.

I've defined the following head models:

  • ViltForMaskedLM
  • ViltForVisualQuestionAnswering
  • ViltForNaturalLanguageVisualReasoning
  • ViltForImageRetrievalTextRetrieval (CLIP-like model).

To do:

  • add ViltForNaturalLanguageVisualReasoning to the tests. However, I do have a question here: it's the only model that requires config.modality_type_vocab_size = 3 instead of 2. How can I handle this exception in the tests? I could do it like this:
for model_class in self.all_model_classes:
    config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
    config.return_dict = True

    if model_class.__name__ == "ViltForNaturalLanguageVisualReasoning":
       config.modality_type_vocab_size = 3

But that's not ideal as it would require overwrite each individual test.

Update: fixed by create a separate ModelTester for this particular model, that overrides the get_config.

@NielsRogge NielsRogge requested a review from sgugger December 23, 2021 11:11
@NielsRogge NielsRogge mentioned this pull request Jan 8, 2022
Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, thanks for working on it @NielsRogge!

I left a few comments, and would love @sgugger's review before this is merged.

docs/source/index.rst Outdated Show resolved Hide resolved
docs/source/model_doc/vilt.mdx Outdated Show resolved Hide resolved
docs/source/model_doc/vilt.mdx Outdated Show resolved Hide resolved
docs/source/model_doc/vilt.mdx Outdated Show resolved Hide resolved
src/transformers/models/vilt/configuration_vilt.py Outdated Show resolved Hide resolved
src/transformers/models/vilt/modeling_vilt.py Show resolved Hide resolved
src/transformers/models/vilt/modeling_vilt.py Show resolved Hide resolved
src/transformers/models/vilt/modeling_vilt.py Outdated Show resolved Hide resolved
@@ -102,6 +102,9 @@
# should **not** be the rule.
IGNORE_NON_AUTO_CONFIGURED = PRIVATE_MODELS.copy() + [
# models to ignore for model xxx mapping
"ViltForMaskedLM",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AutoModelForMaskeLM should correctly return this, no?

Copy link
Contributor Author

@NielsRogge NielsRogge Jan 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I asked @Narsil about this, but the AutoModelForMaskedLM doesn't currently accept models that take several modalities as input. ViLT can take in pixel_values and input_ids, and you can mask out several input_ids, which the model needs to predict. However, the "fill-mask" pipeline currently only works for models that only take input_ids as input.

Copy link
Contributor

@Narsil Narsil Jan 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pipeline fill-mask doesn't work with multi modalities. And it assumes every AutoModelForMaskedLM is for filling text only masks.
As I mentionned too:

  • Either we don't make it AutoMaskedLM
  • Or we make it AutoMaskedLM but we need an escape hatch of some kind so that the pipeline can know it's not supposed to work (or it works and simply doesn't use the image, or has pure padded image or something along those lines)
  • AutoModel should work nonetheless (I assume this discards that)

tests/test_modeling_vilt.py Show resolved Hide resolved
@sgugger
Copy link
Collaborator

sgugger commented Jan 10, 2022

I'd like the PR to be green (or mostly green) before reviewing.

@NielsRogge NielsRogge force-pushed the modeling_vilt branch 2 times, most recently from 18c9637 to 4742d49 Compare January 11, 2022 13:43
@NielsRogge
Copy link
Contributor Author

@sgugger should be mostly green now.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for adding this model!

Regarding your question for the tests, I think the easiest would be to define a new Tester and Test class for ViltForNaturalLanguageVisualReasoning that inehrits from the main Tester and Test class this PR adds, then overrides the method to get the config, so that you don't have to rewrite all the tests.

docs/source/_toctree.yml Outdated Show resolved Hide resolved
docs/source/_toctree.yml Show resolved Hide resolved
docs/source/model_doc/vilt.mdx Outdated Show resolved Hide resolved
[[autodoc]] ViltForVisualQuestionAnswering
- forward

## ViltForNaturalLanguageVisualReasoning
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea what Natural Language Visual Reasoning means, so there is probably a better name to find here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's because this model was fine-tuned on NLVR: https://lil.nlp.cornell.edu/nlvr/

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably deserves a nice introduction in the docstring of that model.

docs/source/model_doc/vilt.mdx Outdated Show resolved Hide resolved
src/transformers/models/vilt/processing_vilt.py Outdated Show resolved Hide resolved
@@ -326,12 +326,6 @@ def forward(self, hidden_states, head_mask=None, output_attentions=False):

# in ViT, layernorm is also applied after self-attention
layer_output = self.layernorm_after(hidden_states)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as for Beit above.

tests/test_feature_extraction_vilt.py Outdated Show resolved Hide resolved
tests/test_modeling_vilt.py Outdated Show resolved Hide resolved
tests/test_modeling_vilt.py Outdated Show resolved Hide resolved
@sgugger
Copy link
Collaborator

sgugger commented Jan 19, 2022

Note: with the new build dev job merged, you can preview the doc here :-)

@NielsRogge NielsRogge merged commit ac22709 into huggingface:master Jan 19, 2022
@HuggingFaceDocBuilder
Copy link

Great job merging this PR! the documentation will now be removed from the staging environment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants