-
Notifications
You must be signed in to change notification settings - Fork 27.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ViLT #14895
Add ViLT #14895
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, thanks for working on it @NielsRogge!
I left a few comments, and would love @sgugger's review before this is merged.
@@ -102,6 +102,9 @@ | |||
# should **not** be the rule. | |||
IGNORE_NON_AUTO_CONFIGURED = PRIVATE_MODELS.copy() + [ | |||
# models to ignore for model xxx mapping | |||
"ViltForMaskedLM", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AutoModelForMaskeLM
should correctly return this, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I asked @Narsil about this, but the AutoModelForMaskedLM
doesn't currently accept models that take several modalities as input. ViLT can take in pixel_values
and input_ids
, and you can mask out several input_ids, which the model needs to predict. However, the "fill-mask" pipeline currently only works for models that only take input_ids
as input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pipeline fill-mask
doesn't work with multi modalities. And it assumes every AutoModelForMaskedLM
is for filling text only masks.
As I mentionned too:
- Either we don't make it
AutoMaskedLM
- Or we make it
AutoMaskedLM
but we need an escape hatch of some kind so that the pipeline can know it's not supposed to work (or it works and simply doesn't use the image, or has pure padded image or something along those lines) AutoModel
should work nonetheless (I assume this discards that)
I'd like the PR to be green (or mostly green) before reviewing. |
18c9637
to
4742d49
Compare
@sgugger should be mostly green now. |
afc75d2
to
d66f5bf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for adding this model!
Regarding your question for the tests, I think the easiest would be to define a new Tester
and Test
class for ViltForNaturalLanguageVisualReasoning
that inehrits from the main Tester
and Test
class this PR adds, then overrides the method to get the config, so that you don't have to rewrite all the tests.
docs/source/model_doc/vilt.mdx
Outdated
[[autodoc]] ViltForVisualQuestionAnswering | ||
- forward | ||
|
||
## ViltForNaturalLanguageVisualReasoning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no idea what Natural Language Visual Reasoning means, so there is probably a better name to find here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's because this model was fine-tuned on NLVR: https://lil.nlp.cornell.edu/nlvr/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably deserves a nice introduction in the docstring of that model.
@@ -326,12 +326,6 @@ def forward(self, hidden_states, head_mask=None, output_attentions=False): | |||
|
|||
# in ViT, layernorm is also applied after self-attention | |||
layer_output = self.layernorm_after(hidden_states) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as for Beit above.
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
ef6b57c
to
7a3b2bb
Compare
Note: with the new build dev job merged, you can preview the doc here :-) |
Great job merging this PR! the documentation will now be removed from the staging environment. |
What does this PR do?
This PR adds ViLT (Vision and Language Transformer).
It's a very nice, minimal multi-modal model, as it only adds a text embedding layer to an existing ViT.
I've defined the following head models:
ViltForMaskedLM
ViltForVisualQuestionAnswering
ViltForNaturalLanguageVisualReasoning
ViltForImageRetrievalTextRetrieval
(CLIP-like model).To do:
ViltForNaturalLanguageVisualReasoning
to the tests. However, I do have a question here: it's the only model that requiresconfig.modality_type_vocab_size = 3
instead of 2. How can I handle this exception in the tests? I could do it like this:But that's not ideal as it would require overwrite each individual test.
Update: fixed by create a separate
ModelTester
for this particular model, that overrides theget_config
.