-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add RemBERT model code to huggingface #10692
Conversation
@LysandreJik I've been mostly following: https://huggingface.co/transformers/add_new_model.html so far. I also see that the doc mentions that there is no fast version for sentencepiece, which this model uses. Is that the case given T5 seems to have one? Edit: Seem to have found a way to add a FastTokenizer version, doc still seems out of sync. |
For the TF code, I'm struggling a bit to initialize an output embedding layer in |
@LysandreJik : They are still some things to iron out but I think this is ready for a first look. What is missing:
What I'd like some input on:
|
Toplevel models were for legacy (historical) integrations and we now namespace all models. If this work was conducted at Google yes google is the right namespace! Do you want us to add you to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Iwontbecreative, thanks a lot for your contribution - In addition to what Julien said:
I'm having some issue on the TFRemBertLMPredictionHead implementation.
We'd be happy to help you on that front, we'll take a look ASAP
I'm finding a discrepancy between this implementation and the original tf one.
A difference of *e-3 doesn't look too bad, but looking at the integration test you have provided, it seems that the difference is noticeable. Is it possible that a bias is missing, or something to do with attention masks?
If it proves impossible to get the two implementations closer to each other, then we'll rely on a fine-tuning tests: if we can obtain similar results on a same dataset with the two implementations, then we'll be good to go.
Not impossible but given the transformer section is simply Bert I doubt it. Also does seem like the results would change more.
I've tried to do that for a bit, unfortunately hard to fine-tune this model on a colab on XNLI (training gets interrupted too early on). Will try to see if I can get a better finetuning setup. |
That would be helpful, though I'm no longer affiliated with Google so not sure what the usual policy is there. If it is ok that will be easier than having to send the checkpoints to @hwchung so he uploads them. |
Ultimately the org admins should decide, but for now I think it's perfectly acceptable if you're a part of the org. I added you manually. |
@Iwontbecreative I opened a PR on your branch that should fix all the failing tests here: Iwontbecreative#1 I've separated each test suite (TF, style, docs) in three commits if you want to have a look at smaller portions at a time. |
Thanks Lysandre. Have not forgotten about this issue, just need one more approval from Google to open source the checkpoint so waiting for this. |
Sure @Iwontbecreative, let us know when all is good! |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Reopen: this is still in progress, just got the open sourcing done from google last Friday. Travelling for a few days and then can tackle this. See: https://github.com/google-research/google-research/tree/master/rembert |
That's great news @Iwontbecreative! |
(Finally) some updates on this now that the Google open sourcing is done.
Main issue that still exists: the discrepancy between my implementation and the tf one. Welcoming ideas on this one. The code should now be easier for everyone to test now that the tf version is available and mine is uploaded to the model hub. @LysandreJik think this is ready for another look. Thanks for the patience here! |
Welcome back! We'll take a look shortly. Do you have an example of code to run to quickly test this version with the TF one? |
Sadly the tensorflow code to run the model is not available externally. The main change to the modelling code is here: I do however have example inputs and outputs run by my coauthor: Model outputsExample modelling outputs at several layers for different input_ids: Tokenization outputs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! In that case, we'll try and reproduce the fin-tuning results to ensure that the implementation is indeed correct.
There are a few # Copied from ...
statements missing, could you add them? I've put some proposals where some of them should be, there should be more in the PyTorch/TensorFlow implementations.
Thank you!
Fine-tuning was, here is what I was able to run, comnparing performance on XNLI: https://docs.google.com/spreadsheets/d/1gWWSLo7XxEZkXpX272tQoZBXTgs96IFvh-fwqVqihM0/edit#gid=0 Performance matches in English but does seem to be lower on other languages. We used more hyperparam tuning at Google but I do not think that explains the whole difference for those languages. I think there might be a subtle difference that is both causing the results to differ slightly and the worse fine-tuning outcomes. The model is still much better than random so most of it should be there. |
Performance does look pretty similar, and good enough for me to merge it. There are a few |
…uggingface#11825) get_length_grouped_indices() in LengthGroupedSampler and DistributedLengthGroupedSampler is prohibitively slow for large number of megabatches (in test case takes hours for ~270k megabatches with 100 items each) due to slow list concatenation with sum(megabatches, []). Resolves: huggingface#11795 Co-authored-by: ctheodoris <cvtheodo@ds.dfci.harvard.edu>
* fix_torch_device_generate_test * remove @ * change pytorch import to flax import
Hi @LysandreJik
Seems like it is mostly ready, though tests fail at the |
Actually, managed to find the issue
Also renamed rembert-large to rembert since this is the only version we are open-sourcing at this time. Edit: Not sure why the model templates check is failing, but think this should be ready for merge with one last review. |
Fantastic, thanks a lot @Iwontbecreative! I'll take a final look, fix the model templates issue and ping another reviewer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! Thanks for adding all the # Copied from
statements. I just checked locally and I believe the model templates error is unrelated to your PR - I'll have to check but running the model templates locally yields no error so I think we can merge this PR without this test passing and we'll make sure it does pass on master
(or fix if it doesn't).
Pinging other reviewers to give your PR a second look.
Thanks for the great work!
src/transformers/models/rembert/convert_rembert_tf_checkpoint_to_pytorch.py
Outdated
Show resolved
Hide resolved
self.dropout = nn.Dropout(config.hidden_dropout_prob) | ||
|
||
# position_ids (1, len position emb) is contiguous in memory and exported when serialized | ||
self.register_buffer("position_ids", torch.arange(config.max_position_embeddings).expand((1, -1))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) This also applies to other models, but I'm not really a fan of registering the position ids here as a parameter. I think if they aren't passed they should be generated on the fly. Logically, it just doesn't make too much sense to have them in the weight dictionary in PyTocrh IMO. In Flax, one would never register this input as a parameter so that at the moment this parameters are reported missing when loading such a model in Flax cc @LysandreJik @sgugger
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For newer additions, this buffer could be registered as a non-persistent buffer like it was done in BERT (only for torch 1.6+):
transformers/src/transformers/models/bert/modeling_bert.py
Lines 183 to 188 in 546dc24
if version.parse(torch.__version__) > version.parse("1.6.0"): | |
self.register_buffer( | |
"token_type_ids", | |
torch.zeros(self.position_ids.size(), dtype=torch.long, device=self.position_ids.device), | |
persistent=False, | |
) |
Without this the models cannot be traced with torchscript
and I believe cannot be used for other inference engines such as onnxruntime
if I recall (cc @mfuntowicz)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just making sure, is @LysandreJik's pointer what I should do for the code (replacing with torch.arange)?
def __init__(self, config): | ||
super().__init__(config) | ||
|
||
config.num_labels = 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think it's better to not change the config here silently, but rather just set self.num_labels=2
. A model that is loaded with this class and then saved again will have its config changed silently no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point, and I have removed this row.
However, I think the issue comes from the cookie_cutter template (see: modelling_{{cookiecutter.lowercase_modelnam}}.py lines 1467 and 3010). As such, mbart, roformer, led, bat, big_bird, big_bird_pegasus are all impacted by this.
] | ||
) | ||
|
||
# Running on the original tf implementation gives slightly different results here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, it would be nice to have an integration test with 1e-3 tolerance only. IMO, it would be important to find out what the difference is before merging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree and have tried a few things to bridge the gap, including running things layer by layer.
We know that the discrepancy starts from the first layer onwards. Results are equal for the tf and hf version after we upproject the embeddings, but differ after the first layer.
Unfortunately, I no longer have access to the tf codebase inside google and it has been hard to get sublayer outputs to understand which op exactly explains the difference. I also tried to run a home-baked version of the tf codebase but had trouble installing a old enough version of tf + running it. Ultimately, I mostly looked at the fine-tuning results, which are good, though there is still a gap there. @hwchung might be able to help but he has been pretty busy lately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very good to me already! Thanks a lot for your work @Iwontbecreative !
I'm a bit concerned about the integration test not giving identical results and IMO, it would be worth finding out why this is exactly before merging.
One other thing, that might be cleaner would be to remove the position_embedding_type
logic in all modules at the cost of having to remove # Copied from
statements (interested in hearning @LysandreJik and @sgugger's opinion on this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for adding those new models. It looks like the template you used was a bit outdated, left quite a few suggestions to fix everything and have it up to date :-)
Also, the checkpoint used everywhere was rembert
when it should be google/rembert
.
@patrickvonplaten Thanks for the helpful feedback. Incorporated most of it. See the comment on possible needed changes to the cookiecutter templates to address on of your comments in the future. @sgugger Regarding older template: Yes, this PR ended up being delayed due to slow open-sourcing process at Google, so the templates were a bit out of date. Thanks for catching most of the mistakes. |
Hi @LysandreJik, any last remaining steps before this can be merged? Would like to get this in to avoid further rebases if possible. |
I think this can be merged - thanks for your effort @Iwontbecreative, fantastic addition! |
I'm not entirely sure why there was 88 authors involved or 250 commits squashed into a single one - but I did verify only your changes were merged. Could you let me know how you handled the merge/rebasing of this branch so that I may understand what happened w.r.t the number of commits included? |
I think I just merged the master's changes into my branch to ensure it was up to date with upstream. Maybe I needed to rebase? |
Hi @Iwontbecreative thanks for adding the RemBERT model! Do you have a list of the 110 languages used in the pretraining of the model? |
Sure, here's the list: ['af', 'am', 'ar', 'az', 'be', 'bg', 'bg-Latn', 'bn', 'bs', 'ca', 'ceb', 'co', 'cs', 'cy', 'da', 'de', 'el', 'el-Latn', 'en', 'eo', 'es', 'et', 'eu', 'fa', 'fi', 'fil', 'fr', 'fy', 'ga', 'gd', 'gl', 'gu', 'ha', 'haw', 'hi', 'hi-Latn', 'hmn', 'hr', 'ht', 'hu', 'hy', 'id', 'ig', 'is', 'it', 'iw', 'ja', 'ja-Latn', 'jv', 'ka', 'kk', 'km', 'kn', 'ko', 'ku', 'ky', 'la', 'lb', 'lo', 'lt', 'lv', 'mg', 'mi', 'mk', 'ml', 'mn', 'mr', 'ms', 'mt', 'my', 'ne', 'nl', 'no', 'ny', 'pa', 'pl', 'ps', 'pt', 'ro', 'ru', 'ru-Latn', 'sd', 'si', 'sk', 'sl', 'sm', 'sn', 'so', 'sq', 'sr', 'st', 'su', 'sv', 'sw', 'ta', 'te', 'tg', 'th', 'tr', 'uk', 'ur', 'uz', 'vi', 'xh', 'yi', 'yo', 'zh', 'zh-Hans', 'zh-Hant', 'zh-Latn', 'zu'] cf https://github.com/google-research/google-research/tree/master/rembert |
Add RemBERT model to Huggingface ( https://arxiv.org/abs/2010.12821 ).
This adds code to support the RemBERT model in Huggingface.
In terms of implementation, this is roughly a scaled up version of mBERT with ALBERT-like factorized embeddings and tokenizer.
Still needs to be done:
Fixes #9711
Before submitting
Pull Request section?
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@LysandreJik seems appropriate here.