-
Notifications
You must be signed in to change notification settings - Fork 27.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EncoderDecoder] Add encoder-decoder for roberta/ vanilla longformer #6411
[EncoderDecoder] Add encoder-decoder for roberta/ vanilla longformer #6411
Conversation
Codecov Report
@@ Coverage Diff @@
## master #6411 +/- ##
==========================================
- Coverage 79.77% 77.84% -1.94%
==========================================
Files 150 150
Lines 27789 27826 +37
==========================================
- Hits 22170 21660 -510
- Misses 5619 6166 +547
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great to see the encoder/decoder framework expanded! Thanks for all the work!
src/transformers/modeling_roberta.py
Outdated
output_attentions=None, | ||
output_hidden_states=None, | ||
return_dict=None, | ||
**kwargs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you using those kwargs? If so change the docstrings since there is no legacy arguments here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed them from the corresponding BERT model as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Have people had good ROUGE with the compose two pretrained glue models and finetune for summarization approach?
Hmm, I think it's very new so not sure if many people have tried out the framework yet. @patil-suraj - do you know if people work a lot with EncoderDecoder by chance? |
Seems like it, seen quite a few issues and questions (on forum as well) regarding EncoderDecoder, but no one has reported any good results yet |
Looks great. Thanks, @patrickvonplaten.
@sshleifer, was thinking about the same thing. My guess is that numbers won't be great because cross-attention is randomly initialized? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, LGTM!
@add_start_docstrings( | ||
"""RoBERTa Model with a `language modeling` head on top for CLM fine-tuning. """, ROBERTA_START_DOCSTRING | ||
) | ||
class RobertaForCausalLM(BertPreTrainedModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be more coherent to have it as RobertaLMHeadModel
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(But I do prefer RobertaForCausalLM
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the same names for BERT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following internal discussion will leave the name as it is more precise and BertLMHeadModel
should change in the future.
labels=None, | ||
encoder_hidden_states=None, | ||
encoder_attention_mask=None, | ||
labels=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch
src/transformers/modeling_roberta.py
Outdated
kwargs (:obj:`Dict[str, any]`, optional, defaults to `{}`): | ||
Used to hide legacy arguments that have been deprecated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for this I think
Btw, this paper does some great analysis on reusing checkpoints for Seq2Seq models: https://arxiv.org/pdf/1907.12461.pdf |
…gformer (huggingface#6411)" This reverts commit 8b8f41f.
This PR adds Roberta to the Encoder Decoder framework. Thus, it automatically makes it possible to use both
Roberta2Roberta
models andLongformer2Roberta
model:and
Also pinging @ibeltagy and @patil-suraj