-
Notifications
You must be signed in to change notification settings - Fork 27.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split LMBert model in two #4874
Conversation
Codecov Report
@@ Coverage Diff @@
## master #4874 +/- ##
==========================================
- Coverage 77.11% 76.43% -0.69%
==========================================
Files 128 128
Lines 21651 21671 +20
==========================================
- Hits 16697 16564 -133
- Misses 4954 5107 +153
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is cool, it simplifies the BertForMaskedLM
model and makes the API more coherent.
However, this is a breaking change. The model now behaves differently, with part of its functions being removed.
I'm not sure that breaking change is worth it.
It's possible to do it in a non-breaking way with a deprecation warning if a not-None |
I'm fine with this PR. IMO, In my opinion, this change is necessary to have a clean separation between masked lm and causal lm (Reformer and Longformer will eventually run into the same issue). The heavily used One thing which is problematic though is that the
Now since Bert has both a causal model and a masked encoder model we need two mappings. I would suggest here to create 2 new mappings We can add |
@add_start_docstrings( | ||
"""Bert Model with a `language modeling` head on top for CLM fine-tuning. """, BERT_START_DOCSTRING | ||
) | ||
class BertLMHeadModel(BertPreTrainedModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the init should have an assert statement now to make sure config.is_decoder=True
or even set config.is_decoder=True
with a logging info that it does so. Using BertLMHeadModel
without a causal mask does not make much sense IMO.
Also, currently BertLayer
always initializes the cross-attention layer if config.is_decoder=True
, but doesn't use it if no encoder_hidden_states
are passed to forward => so there is not really a problem atm when using the model on its own (not in an encoder-decoder framework). BUT, this should be corrected with an additional check self.config.is_encoder_decoder
before adding the cross-attention layers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added an assert.
super().__init__(config) | ||
|
||
self.bert = BertModel(config) | ||
self.cls = BertOnlyMLMHead(config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could also do a check here that self.config.is_decoder == False
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't raise an exception here before the automodel in encoder_decode picks the right LM model. (Or the tests will fail.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, we could do this in a separate PR
**kwargs | ||
): | ||
encoder_model = BertModel(config) | ||
decoder_model = BertForMaskedLM(decoder_config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change to BertLMHeadModel
in all other tests here.
I'm fine with this one being deleted since I don't really see an application where one would train an encoder-decoder model on the masked lm objective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Note that the tests using automodel pick the wrong decoder for now.
I agree with @patrickvonplaten on the need to split |
I agree that having two additional I don't agree with renaming all causal model with language modeling heads |
In that case I would advocate to keep |
|
In the short term, I would advocate only exposing the classical "masked-lm" flavour of BERT through AutoModelWithLMHead (as is done in this PR), and not even documenting/adding BertLMHeadModel to the In the longer term, I'd be ok with creating |
@julien-c as long as we do a major release for the AutoModel renaming, I'm all for this! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
read this, very nice!
For the encoder decoder models, I think we need
BertWithCausalLM
|
I'm fine either way, I think you guys got all the important issues (backward compatibility versus cleanly building the future). I like what @patrickvonplaten and @julien-c are proposing. |
Fixed conflicts and followed @julien-c advice. @LysandreJik or @patrickvonplaten, could you do one final review just to make sure everything is fine to merge? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not forget to write an explanation of why that breaking change was necessary in the release notes.
This currently breaks the encoder-decoder framework |
As discussed in #4711, the
BertForMaskedLM
model should be split in two to avoid having two different labels argument, one model for causal LM, one for masked LM. This PR follows up on that and does the split.It introduces a new
BertLMHeadModel
(also added to the__init__
and the docs) with a test. As discussed, there is no deprecation warning if someone tries to use thelm_labels
inBertForMaskedLM
(since it was experimental), but an error message telling the user to useBertLMHeadModel
.I did not add
BertLMHeadModel
in the automodel logic since we probably want users to use causal models for this? Let me know if I should add it even if it's not the best model for that task.I also removed
lm_labels
in theEncoderDecoderModel
since it was only there to support that argument inBertForMaskedLM
(which then removes the corresponding test).