Add RemBERT model code to huggingface #10692

Iwontbecreative · 2021-03-12T23:50:29Z

Add RemBERT model to Huggingface ( https://arxiv.org/abs/2010.12821 ).

This adds code to support the RemBERT model in Huggingface.

In terms of implementation, this is roughly a scaled up version of mBERT with ALBERT-like factorized embeddings and tokenizer.

Still needs to be done:

Fixes #9711

Before submitting

Did you read the contributor guideline,
Pull Request section?
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@LysandreJik seems appropriate here.

Iwontbecreative · 2021-03-13T03:18:50Z

@LysandreJik I've been mostly following: https://huggingface.co/transformers/add_new_model.html so far.
Trying to add the tokenizer. I think I am good for the slow one but not sure what to do for the fast one, in particular how to generate the tokenizer.json style files (e.g: https://huggingface.co/t5-small/resolve/main/tokenizer.json ). Do you have any pointers to that?

I also see that the doc mentions that there is no fast version for sentencepiece, which this model uses. Is that the case given T5 seems to have one?

Edit: Seem to have found a way to add a FastTokenizer version, doc still seems out of sync.

Iwontbecreative · 2021-03-13T04:54:42Z

For the TF code, I'm struggling a bit to initialize an output embedding layer in TFRemBertLMPredictionHead() that interacts well with _resize_token_embeddings. Welcoming any suggestions as what the right approach there is.

Iwontbecreative · 2021-03-17T02:32:50Z

@LysandreJik : They are still some things to iron out but I think this is ready for a first look.

What is missing:

Model hub upload
Minor discrepancy between model and original tf implementation

What I'd like some input on:

I'm having some issue on the TFRemBertLMPredictionHead implementation. I'd like to initialize a new projection from hidden_size to vocab_size (embeddings are decoupled) but I'm struggling to find how to make my implementation compatible with all the get_bias, set_bias details so that it's resize_embeddings friendly. Any chance you could help here? This is the culprit for tests failing AFAICT.
Model hub upload: should this be done on top level (how) or on the Google org model hub?
I'm finding a discrepancy between this implementation and the original tf one. Results are exactly equal up to the first hidden layer (so embeddings and upprojection). On the first layer it differs but by small amounts (~0.002), difference eventually increases up to 0.007. Any idea what are common culprits here? This is just the standard BERT model and differences are small so maybe numerical stability?

julien-c · 2021-03-17T11:16:31Z

Model hub upload: should this be done on top level (how) or on the Google org model hub?

Toplevel models were for legacy (historical) integrations and we now namespace all models. If this work was conducted at Google yes google is the right namespace! Do you want us to add you to the google org?

LysandreJik

Hi @Iwontbecreative, thanks a lot for your contribution - In addition to what Julien said:

I'm having some issue on the TFRemBertLMPredictionHead implementation.

We'd be happy to help you on that front, we'll take a look ASAP

I'm finding a discrepancy between this implementation and the original tf one.

A difference of *e-3 doesn't look too bad, but looking at the integration test you have provided, it seems that the difference is noticeable. Is it possible that a bias is missing, or something to do with attention masks?

If it proves impossible to get the two implementations closer to each other, then we'll rely on a fine-tuning tests: if we can obtain similar results on a same dataset with the two implementations, then we'll be good to go.

src/transformers/models/rembert/__init__.py

src/transformers/models/rembert/modeling_rembert.py

Iwontbecreative · 2021-03-17T22:42:24Z

A difference of *e-3 doesn't look too bad, but looking at the integration test you have provided, it seems that the difference is noticeable. Is it possible that a bias is missing, or something to do with attention masks?

Not impossible but given the transformer section is simply Bert I doubt it. Also does seem like the results would change more.

If it proves impossible to get the two implementations closer to each other, then we'll rely on a fine-tuning tests: if we can obtain similar results on a same dataset with the two implementations, then we'll be good to go.

I've tried to do that for a bit, unfortunately hard to fine-tune this model on a colab on XNLI (training gets interrupted too early on). Will try to see if I can get a better finetuning setup.

Iwontbecreative · 2021-03-17T22:44:47Z

Model hub upload: should this be done on top level (how) or on the Google org model hub?

Toplevel models were for legacy (historical) integrations and we now namespace all models. If this work was conducted at Google yes google is the right namespace! Do you want us to add you to the google org?

That would be helpful, though I'm no longer affiliated with Google so not sure what the usual policy is there. If it is ok that will be easier than having to send the checkpoints to @hwchung so he uploads them.

julien-c · 2021-03-18T08:06:56Z

That would be helpful, though I'm no longer affiliated with Google so not sure what the usual policy is there.

Ultimately the org admins should decide, but for now I think it's perfectly acceptable if you're a part of the org. I added you manually.

LysandreJik · 2021-03-22T18:50:25Z

@Iwontbecreative I opened a PR on your branch that should fix all the failing tests here: Iwontbecreative#1

I've separated each test suite (TF, style, docs) in three commits if you want to have a look at smaller portions at a time.

Iwontbecreative · 2021-04-06T22:37:25Z

Thanks Lysandre. Have not forgotten about this issue, just need one more approval from Google to open source the checkpoint so waiting for this.

LysandreJik · 2021-05-03T07:55:21Z

Sure @Iwontbecreative, let us know when all is good!

github-actions · 2021-05-28T15:05:43Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Iwontbecreative · 2021-06-07T01:06:59Z

Reopen: this is still in progress, just got the open sourcing done from google last Friday. Travelling for a few days and then can tackle this. See: https://github.com/google-research/google-research/tree/master/rembert

LysandreJik · 2021-06-07T13:55:38Z

That's great news @Iwontbecreative!

Iwontbecreative · 2021-07-01T03:55:08Z

(Finally) some updates on this now that the Google open sourcing is done.

Updated the PR for the lates transformers
Included @LysandreJik's change (manually instead of merging your PR since I messed up the order, sorry about that)
Uploaded the model to the model hub under iwontbecreative/rembert

Main issue that still exists: the discrepancy between my implementation and the tf one. Welcoming ideas on this one. The code should now be easier for everyone to test now that the tf version is available and mine is uploaded to the model hub.

@LysandreJik think this is ready for another look. Thanks for the patience here!

LysandreJik · 2021-07-01T07:21:11Z

Welcome back! We'll take a look shortly. Do you have an example of code to run to quickly test this version with the TF one?

Iwontbecreative · 2021-07-01T16:01:29Z

Sadly the tensorflow code to run the model is not available externally.
I tried to replicate for some time this morning but it is hard because it relies on https://github.com/google-research/bert which has dependencies that are not available on pip anymore...

The main change to the modelling code is here:
https://github.com/google-research/bert/blob/master/modeling.py#L813-L823
needing to be replaced with:
https://github.com/google-research/albert/blob/master/modeling.py#L1083-L1088
on the tokenization front, it is mainly replacing BertTokenizer with AlbertTokenizer

I do however have example inputs and outputs run by my coauthor:

Model outputs

Example modelling outputs at several layers for different input_ids:
https://pastebin.com/t9bPFmeM
This is the [batch, length, :3] section of the [batch, length, hidden] outputs.

Tokenization outputs

https://pastebin.com/j6D6YE1e

LysandreJik

Thanks! In that case, we'll try and reproduce the fin-tuning results to ensure that the implementation is indeed correct.

There are a few # Copied from ... statements missing, could you add them? I've put some proposals where some of them should be, there should be more in the PyTorch/TensorFlow implementations.

Thank you!

src/transformers/convert_slow_tokenizer.py

src/transformers/modeling_tf_utils.py

src/transformers/models/auto/modeling_auto.py

src/transformers/models/rembert/configuration_rembert.py

src/transformers/models/rembert/modeling_rembert.py

Iwontbecreative · 2021-07-06T14:20:01Z

Fine-tuning was, here is what I was able to run, comnparing performance on XNLI:

https://docs.google.com/spreadsheets/d/1gWWSLo7XxEZkXpX272tQoZBXTgs96IFvh-fwqVqihM0/edit#gid=0

Performance matches in English but does seem to be lower on other languages. We used more hyperparam tuning at Google but I do not think that explains the whole difference for those languages. I think there might be a subtle difference that is both causing the results to differ slightly and the worse fine-tuning outcomes. The model is still much better than random so most of it should be there.

LysandreJik · 2021-07-14T08:59:04Z

Performance does look pretty similar, and good enough for me to merge it.

There are a few # Copied from statements missing though as said in my previous message, in both the PyTorch and TensorFlow implementations. Do you mind adding them? If you're lacking time let me know and I'll do it myself.

…uggingface#11825) get_length_grouped_indices() in LengthGroupedSampler and DistributedLengthGroupedSampler is prohibitively slow for large number of megabatches (in test case takes hours for ~270k megabatches with 100 items each) due to slow list concatenation with sum(megabatches, []). Resolves: huggingface#11795 Co-authored-by: ctheodoris <cvtheodo@ds.dfci.harvard.edu>

* fix_torch_device_generate_test * remove @ * change pytorch import to flax import

Iwontbecreative · 2021-07-16T16:08:21Z

Hi @LysandreJik

Added copy statements
Merged with last master
Uploaded model to google org

Seems like it is mostly ready, though tests fail at the utils/check_copies.py stage of make quality.
I am actually not sure what the issue is in this instance, any chance you could help investigate/fix/merge after?

Iwontbecreative · 2021-07-16T16:36:47Z

Actually, managed to find the issue

utils/check_copies.py crashes without a helpful error message if the "Copied from" statement is before the decorator. I was just overeager with my copied from statements.

Also renamed rembert-large to rembert since this is the only version we are open-sourcing at this time.

Edit: Not sure why the model templates check is failing, but think this should be ready for merge with one last review.

LysandreJik · 2021-07-19T06:45:33Z

Fantastic, thanks a lot @Iwontbecreative! I'll take a final look, fix the model templates issue and ping another reviewer.

LysandreJik

This looks great! Thanks for adding all the # Copied from statements. I just checked locally and I believe the model templates error is unrelated to your PR - I'll have to check but running the model templates locally yields no error so I think we can merge this PR without this test passing and we'll make sure it does pass on master (or fix if it doesn't).

Pinging other reviewers to give your PR a second look.

Thanks for the great work!

src/transformers/modeling_tf_utils.py

src/transformers/models/rembert/tokenization_rembert.py

src/transformers/models/rembert/convert_rembert_tf_checkpoint_to_pytorch.py

src/transformers/models/rembert/modeling_rembert.py

patrickvonplaten · 2021-07-19T08:47:56Z

src/transformers/models/rembert/modeling_rembert.py

+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+
+        # position_ids (1, len position emb) is contiguous in memory and exported when serialized
+        self.register_buffer("position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)))


(nit) This also applies to other models, but I'm not really a fan of registering the position ids here as a parameter. I think if they aren't passed they should be generated on the fly. Logically, it just doesn't make too much sense to have them in the weight dictionary in PyTocrh IMO. In Flax, one would never register this input as a parameter so that at the moment this parameters are reported missing when loading such a model in Flax cc @LysandreJik @sgugger

For newer additions, this buffer could be registered as a non-persistent buffer like it was done in BERT (only for torch 1.6+):

transformers/src/transformers/models/bert/modeling_bert.py

Lines 183 to 188 in 546dc24

if version.parse(torch.__version__) > version.parse("1.6.0"):

self.register_buffer(

"token_type_ids",

torch.zeros(self.position_ids.size(), dtype=torch.long, device=self.position_ids.device),

persistent=False,

)

Without this the models cannot be traced with torchscript and I believe cannot be used for other inference engines such as onnxruntime if I recall (cc @mfuntowicz)

Just making sure, is @LysandreJik's pointer what I should do for the code (replacing with torch.arange)?

patrickvonplaten · 2021-07-19T08:54:50Z

src/transformers/models/rembert/modeling_rembert.py

+    def __init__(self, config):
+        super().__init__(config)
+
+        config.num_labels = 2


Think it's better to not change the config here silently, but rather just set self.num_labels=2. A model that is loaded with this class and then saved again will have its config changed silently no?

This is a good point, and I have removed this row.

However, I think the issue comes from the cookie_cutter template (see: modelling_{{cookiecutter.lowercase_modelnam}}.py lines 1467 and 3010). As such, mbart, roformer, led, bat, big_bird, big_bird_pegasus are all impacted by this.

src/transformers/models/rembert/modeling_tf_rembert.py

src/transformers/models/rembert/tokenization_rembert_fast.py

patrickvonplaten · 2021-07-19T08:58:53Z

tests/test_modeling_rembert.py

+            ]
+        )
+
+        # Running on the original tf implementation gives slightly different results here.


Hmm, it would be nice to have an integration test with 1e-3 tolerance only. IMO, it would be important to find out what the difference is before merging

Yes, I agree and have tried a few things to bridge the gap, including running things layer by layer.

We know that the discrepancy starts from the first layer onwards. Results are equal for the tf and hf version after we upproject the embeddings, but differ after the first layer.

Unfortunately, I no longer have access to the tf codebase inside google and it has been hard to get sublayer outputs to understand which op exactly explains the difference. I also tried to run a home-baked version of the tf codebase but had trouble installing a old enough version of tf + running it. Ultimately, I mostly looked at the fine-tuning results, which are good, though there is still a gap there. @hwchung might be able to help but he has been pretty busy lately.

docs/source/model_doc/rembert.rst

patrickvonplaten

Looks very good to me already! Thanks a lot for your work @Iwontbecreative !
I'm a bit concerned about the integration test not giving identical results and IMO, it would be worth finding out why this is exactly before merging.
One other thing, that might be cleaner would be to remove the position_embedding_type logic in all modules at the cost of having to remove # Copied from statements (interested in hearning @LysandreJik and @sgugger's opinion on this)

sgugger

Thanks a lot for adding those new models. It looks like the template you used was a bit outdated, left quite a few suggestions to fix everything and have it up to date :-)

Also, the checkpoint used everywhere was rembert when it should be google/rembert.

docs/source/index.rst

docs/source/model_doc/rembert.rst

src/transformers/models/auto/configuration_auto.py

src/transformers/models/rembert/modeling_rembert.py

src/transformers/models/rembert/tokenization_rembert.py

src/transformers/models/rembert/tokenization_rembert_fast.py

Iwontbecreative · 2021-07-19T20:27:42Z

@patrickvonplaten Thanks for the helpful feedback. Incorporated most of it. See the comment on possible needed changes to the cookiecutter templates to address on of your comments in the future.
For the discrepancy in results, see my answer above.

@sgugger Regarding older template: Yes, this PR ended up being delayed due to slow open-sourcing process at Google, so the templates were a bit out of date. Thanks for catching most of the mistakes.

Iwontbecreative · 2021-07-23T21:28:38Z

Hi @LysandreJik, any last remaining steps before this can be merged? Would like to get this in to avoid further rebases if possible.

LysandreJik · 2021-07-24T15:31:37Z

I think this can be merged - thanks for your effort @Iwontbecreative, fantastic addition!

LysandreJik · 2021-07-24T15:35:16Z

I'm not entirely sure why there was 88 authors involved or 250 commits squashed into a single one - but I did verify only your changes were merged.

Could you let me know how you handled the merge/rebasing of this branch so that I may understand what happened w.r.t the number of commits included?

Iwontbecreative · 2021-07-26T01:29:00Z

I think I just merged the master's changes into my branch to ensure it was up to date with upstream. Maybe I needed to rebase?

aapot · 2021-07-31T10:39:08Z

Hi @Iwontbecreative thanks for adding the RemBERT model! Do you have a list of the 110 languages used in the pretraining of the model?

Iwontbecreative · 2021-08-04T02:06:21Z

Sure, here's the list:

['af', 'am', 'ar', 'az', 'be', 'bg', 'bg-Latn', 'bn', 'bs', 'ca', 'ceb', 'co', 'cs', 'cy', 'da', 'de', 'el', 'el-Latn', 'en', 'eo', 'es', 'et', 'eu', 'fa', 'fi', 'fil', 'fr', 'fy', 'ga', 'gd', 'gl', 'gu', 'ha', 'haw', 'hi', 'hi-Latn', 'hmn', 'hr', 'ht', 'hu', 'hy', 'id', 'ig', 'is', 'it', 'iw', 'ja', 'ja-Latn', 'jv', 'ka', 'kk', 'km', 'kn', 'ko', 'ku', 'ky', 'la', 'lb', 'lo', 'lt', 'lv', 'mg', 'mi', 'mk', 'ml', 'mn', 'mr', 'ms', 'mt', 'my', 'ne', 'nl', 'no', 'ny', 'pa', 'pl', 'ps', 'pt', 'ro', 'ru', 'ru-Latn', 'sd', 'si', 'sk', 'sl', 'sm', 'sn', 'so', 'sq', 'sr', 'st', 'su', 'sv', 'sw', 'ta', 'te', 'tg', 'th', 'tr', 'uk', 'ur', 'uz', 'vi', 'xh', 'yi', 'yo', 'zh', 'zh-Hans', 'zh-Hant', 'zh-Latn', 'zu']

cf https://github.com/google-research/google-research/tree/master/rembert

Iwontbecreative changed the title ~~[WIP] Add RemBERT model code to huggingface~~ Add RemBERT model code to huggingface Mar 17, 2021

Iwontbecreative marked this pull request as ready for review March 17, 2021 02:21

LysandreJik reviewed Mar 17, 2021

View reviewed changes

src/transformers/models/rembert/__init__.py Outdated Show resolved Hide resolved

src/transformers/models/rembert/modeling_rembert.py Outdated Show resolved Hide resolved

src/transformers/models/rembert/modeling_rembert.py Show resolved Hide resolved

huggingface deleted a comment from github-actions bot May 3, 2021

github-actions bot closed this Jun 6, 2021

julien-c reopened this Jun 7, 2021

LysandreJik added the WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress label Jun 7, 2021

LysandreJik reviewed Jul 2, 2021

View reviewed changes

ctheodoris and others added 4 commits July 15, 2021 13:14

Replace double occurrences as the last step (huggingface#11367)

ec617be

[Flax] Fix PyTorch import error (huggingface#11839)

d5c7461

* fix_torch_device_generate_test * remove @ * change pytorch import to flax import

Fix reference to XLNet (huggingface#11846)

a190a77

Fix various duplicates from merging

892482b

Rembert-large -> rembert, fix overeager Copied from, return type

62b9f7a

LysandreJik approved these changes Jul 19, 2021

View reviewed changes

src/transformers/modeling_tf_utils.py Show resolved Hide resolved

src/transformers/models/rembert/tokenization_rembert.py Outdated Show resolved Hide resolved

LysandreJik requested review from sgugger and patrickvonplaten July 19, 2021 07:08