Full rework of the TF input/output embeddings and bias resizing #9193

jplu · 2020-12-18T16:28:25Z

What does this PR do?

This PR 100% reworks the entire process of input/output and bias resizing. Now the exceptions are better handled including the names that now are always similar. The corresponding tests have also been entirely reworked and now have a better coverage of this feature.

This PR adds a small breaking change. Now the get_input_embeddings methods returns the weights and not anymore the embedding layer.

LysandreJik · 2020-12-18T18:55:58Z

src/transformers/modeling_tf_utils.py

-    def get_input_embeddings(self) -> tf.keras.layers.Layer:
+    def get_input_embeddings(self) -> tf.Variable:


That seems like a breaking change, doesn't it? We're not returning the layer anymore, but the weights of that layer. Why is this change necessary?

And the get_output_embeddings method still returns a tf.keras.layers.Layer

They all return a tf.Variable that are the embeddings/bias weights. This is now necessary in order to be sure to have the proper attribute and not guessing if it is word_embeddings, weights, bias, decoder_bias, etc... and to make this as generic as possible, including for the naming.

By "naming" I mean that now, we don't need anymore to manually build the name of each resized weights as thanks to these changes, they are always the same.

get_input_embeddings still returns a layer since the last commit 👍

sgugger · 2020-12-18T19:35:00Z

I haven't reviewed in detail yet, but just looking at the API with the number of things to change for ALBERT (and in terms of line of code) is a hard pass for me. Overriding the resize method as was done before was way easier, this adds too much complexity.

jplu · 2020-12-18T19:38:22Z

I understand that it is a big update. Nevertheless, the way it was done before didn't worked and was quite buggy (the tests basically was testing almost nothing) and to make the resizing properly working, these changes are necessary.

jplu · 2020-12-18T19:40:55Z

In all cases I'm open to any suggestion that will reduce the number of changes :)

jplu · 2020-12-18T22:56:25Z

@sgugger @LysandreJik I tried a new approach for the resizing that reduce a lot the changes in each model implementation, it is even much shorter than what we currently have in master. I have done my test only on ALBERT for now, can you recheck that file and let me know what you think about it.

src/transformers/modeling_tf_utils.py

patrickvonplaten

@jplu - can you clarify a bit why we need this new "TF input/output embeddings resizing" PR and why the previous PR: #8657 was unsatisfactory?

jplu · 2020-12-21T08:50:07Z

Ok I will clarify this a bit more:

The fist most important issue in the current implementation is that the resizing is not graph compilation+ execution compliant because of the usage of numpy and tensor.numpy() calls and then not usable in such cases.
The naming was depending of where the build was coming from, that's why we needed the get_prefix_bias_name for the bias and the manual build of the embeddings names. Which was a temporary fix, because it is very error prone, because the naming depends of several other things that are not taken into account into this manual build.
Resizing was not working for some models, such as BART which doesn't work properly and raises an error. (Proof that the tests was not testing everything)
The current resizing has two issues when resizing when we instantiate a model from scratch: either it raises an attribute error because the model is not fully built (weights not instantiated) and get a wrong naming, and then if we save the model with this wrong naming we cannot save/load it properly because the naming doesn't correspond to the current architecture.
All the weights names across the models don't share the same names sometimes embeddings.word_embeddings, sometimes shared.weight, sometimes lm_head.bias, sometimes lm_loss.bias, sometimes mlm.bias, sometimes mlm.predictions.bias and many other ways...

As stated in #8657 this was just a temporary fix to go to the quickest way to wait for the real rework.

This PR aims to solve all these issues and bring something more generic and less error prone.

sgugger · 2020-12-21T14:58:18Z

I personally never understood that #8657 was a quick fix that was needing another PR afterwards. We cannot operate by adding new methods in one release then breaking them or deleting them in the next so the works that was done in #8657 needs to be built upon not destroyed (and please, say in bold next time you are just making a quick fix as I would never have approved #8657 to be merged had I known...)

So before we review this, the following need to be addressed:

get_input_embeddings needs to keep the same return type
get_output_embeddings needs to keep the same return type
get_output_layer_with_bias can't disappear
get_prefix_bias_name can't disappear

This is annoying but this is why we usually don't merge a half-baked fix introducing new APIs, we can't break that after.

jplu · 2020-12-21T15:22:27Z

We can keep this for the next major release.

What you ask is doable but will make the codebase more complicated. I will rework this.

jplu · 2020-12-21T17:27:39Z

I have just done the following restore:

get_input_embeddings still returns a layer
get_output_embeddings still returns a layer
get_output_layer_with_bias is back
get_prefix_bias_name is back

The old and new approach was much more compliant than I thought so it was easier to restore what @sgugger asked, and now there should be zero breaking change. Really sorry for the misunderstanding, I will clearer next time.

sgugger

It's better without the breaking changes, thanks for adapting.

Reviewing this, I see a lot of:

   def get_output_embeddings(self):
        return self.input_embeddings

    def set_output_embeddings(self, value):
        self.input_embeddings.weight = value
        self.input_embeddings.vocab_size = shape_list(value)[0]

    def get_bias(self):
        return {"bias": self.bias}

    def set_bias(self, value):
        self.bias = value["bias"]
        self.vocab_size = shape_list(value["bias"])[0]

which makes me think there is a better way to code the default in modeling_tf_utils.

At the same time, I have a bad feeling this is currently breaking the weight-tying which is currently untested (cf my comment in _resize_token_embeddings, L803) so I think we should first have a test of the weight-tying before merging.

sgugger · 2020-12-23T20:32:48Z

src/transformers/modeling_tf_utils.py

@@ -534,6 +534,34 @@ def load_tf_weights(model, resolved_archive_file):
    return missing_layers, unexpected_layers


+def init_copy_embeddings(old_embeddings, new_num_tokens):
+    old_num_tokens, old_embedding_dim = shape_list(old_embeddings)


A dosctring explaining what this function does would be great!

sgugger · 2020-12-23T20:37:55Z

src/transformers/modeling_tf_utils.py

+        )
        return None


This should return self.get_lm_head().

sgugger · 2020-12-23T20:40:34Z

src/transformers/modeling_tf_utils.py

+        old_embeddings = self._find_weights(self.get_input_embeddings())
+        new_embeddings = self._get_resized_embeddings(old_embeddings, new_num_tokens)
+
+        # if word embeddings are not tied, make sure that lm head bias is resized as well


This comment needs updating, it doesn't apply here.

sgugger · 2020-12-23T20:46:47Z

src/transformers/models/bert/modeling_tf_bert.py

    def get_output_layer_with_bias(self):
+        warnings.warn(
+            "The method get_output_layer_with_bias is deprecated. Please use `get_lm_head` instead.", FutureWarning
+        )
        return self.mlm.predictions


This shouldn't be here anymore. By letting the base method in modeling_tf_utils call the self.get_lm_head, we can remove all the overrides of get_output_layer_with_bias in the modeling files.

sgugger · 2020-12-23T20:47:38Z

src/transformers/models/bert/modeling_tf_bert.py

    def get_output_layer_with_bias(self):
+        warnings.warn(
+            "The method get_output_layer_with_bias is deprecated. Please use `get_lm_head` instead.", FutureWarning
+        )
        return self.mlm.predictions


Same for this one.

sgugger · 2020-12-23T20:51:22Z

src/transformers/models/mpnet/modeling_tf_mpnet.py

    def get_output_layer_with_bias(self):
+        warnings.warn(
+            "The method get_output_layer_with_bias is deprecated. Please use `get_lm_head` instead.", FutureWarning
+        )


Same for this one.

sgugger · 2020-12-23T20:51:41Z

src/transformers/models/roberta/modeling_tf_roberta.py

    def get_output_layer_with_bias(self):
+        warnings.warn(
+            "The method get_output_layer_with_bias is deprecated. Please use `get_lm_head` instead.", FutureWarning
+        )
        return self.lm_head


Same for this one.

sgugger · 2020-12-23T20:52:21Z

src/transformers/models/xlm/modeling_tf_xlm.py

    def get_output_layer_with_bias(self):
+        warnings.warn(
+            "The method get_output_layer_with_bias is deprecated. Please use `get_lm_head` instead.", FutureWarning
+        )
        return self.pred_layer


Same for this one.

sgugger · 2020-12-23T20:52:44Z

src/transformers/models/xlnet/modeling_tf_xlnet.py

    def get_output_layer_with_bias(self):
+        warnings.warn(
+            "The method get_output_layer_with_bias is deprecated. Please use `get_lm_head` instead.", FutureWarning
+        )
        return self.lm_loss


Same for this one.

sgugger · 2020-12-23T20:57:20Z

src/transformers/modeling_tf_utils.py

+            self.set_bias(new_lm_head_bias)
+
+        # if word embeddings are not tied, make sure that lm head decoder is resized as well
+        if self.get_output_embeddings() is not None:


In all BERT-like models with tied embeddings, this function returns self.input_embeddings, so not None. This is could then be breaking the tying of the weights (though it's probably just resizing a second time).

If you continue to follow the logic in this if, you can see that in the _get_resized_lm_head_decoder() there is a test if input_embeddings and output_embeddings are equals with

if old_lm_head_decoder is not None and not is_input_output_equals

So yes, I confirm that the tying is well tested 👍

jplu · 2020-12-24T10:07:08Z

@sgugger I should have addressed all your comments :)

which makes me think there is a better way to code the default in modeling_tf_utils.

Share your thoughts 😉

LysandreJik

This looks better than the previous version, the tests are extensive enough. I find it a bit weird that get_bias returns a dict, but I understand why that's necessary.

Also it feels like a big bunch of the test_model_common_attributes could be refactored in the test_modeling_tf_common.py, but it's okay to leave it like this right now as it's explicit enough.

I may be missing something, but as Sylvain said, I'm not seeing any tests related to the tying itself.

tests/test_modeling_tf_dpr.py

LysandreJik · 2021-01-05T10:47:32Z

src/transformers/modeling_tf_utils.py

+    def _find_weights(self, embedding_layer):
+        if hasattr(embedding_layer, "word_embeddings"):
+            return embedding_layer.word_embeddings
+        elif hasattr(embedding_layer, "weight"):
+            return embedding_layer.weight
+        elif hasattr(embedding_layer, "decoder"):
+            return embedding_layer.decoder


We should keep this method in mind when defining new LM head layers otherwise this would grow to be unmaintainable. It's unfortunate that we need to have such a mapping here, but I understand why it's necessary cf your comment here.

I'm currently rethinking the way we implement the embeddings and a part of this "rethinking" is how to get rid of these attribute detection, or at least minimize the side effects.

Can we change the naming a bit maybe? just _find_weights is a bit too generic IMO => _get_word_embedding_weight ? Also (not sure here) maybe raise if nothing is found instead of silent None?

jplu · 2021-01-05T11:02:01Z

Thanks @LysandreJik! For the tying test look in the _get_resized_lm_head_decoder() method. Unless you mean adding a test in test_modeling_tf_common ?

LysandreJik · 2021-01-06T09:57:09Z

I mean I'm not seeing a test that checks get_input_embeddings() == get_output_embeddings() when weights are tied, but I may be missing something here.

I know these two generally point to the same tensors, but no always, do they?

jplu · 2021-01-06T10:15:10Z

I mean I'm not seeing a test that checks get_input_embeddings() == get_output_embeddings() when weights are tied, but I may be missing something here.

Yes, there is a test for this, line 884 in modeling_tf_utils.

I know these two generally point to the same tensors, but no always, do they?

Yes they always point to the same tensor when they equals, 100% sure.

src/transformers/modeling_tf_utils.py

jplu · 2021-01-07T11:08:51Z

@patrickvonplaten the test TFLEDModelTest::test_pt_tf_model_equivalence seems very flaky, it looks like that it randomly pass/fail.

jplu · 2021-01-07T11:18:17Z

Good to merge for me now :)

patrickvonplaten · 2021-01-07T11:50:21Z

@patrickvonplaten the test TFLEDModelTest::test_pt_tf_model_equivalence seems very flaky, it looks like that it randomly pass/fail.

Just fixed it: #9459

jplu · 2021-01-08T11:39:51Z

@sgugger any objection to merge this PR?

sgugger

Looks good to me!

…ingface#9193) * Start rework resizing * Rework bias/decoder resizing * Full resizing rework * Full resizing rework * Start to update the models with the new approach * Finish to update the models * Update all the tests * Update the template * Fix tests * Fix tests * Test a new approach * Refactoring * Refactoring * Refactoring * New rework * Rework BART * Rework bert+blenderbot * Rework CTRL * Rework Distilbert * Rework DPR * Rework Electra * Rework Flaubert * Rework Funnel * Rework GPT2 * Rework Longformer * Rework Lxmert * Rework marian+mbart * Rework mobilebert * Rework mpnet * Rework openai * Rework pegasus * Rework Roberta * Rework T5 * Rework xlm+xlnet * Rework template * Fix TFT5EncoderOnly + DPRs * Restore previous methods * Fix Funnel * Fix CTRL and TransforXL * Apply style * Apply Sylvain's comments * Restore a test in DPR * Address the comments * Fix bug * Apply style * remove unused import * Fix test * Forgot a method * missing test * Trigger CI * naming update * Rebase * Trigger CI

jplu requested review from LysandreJik, sgugger and patrickvonplaten December 18, 2020 16:35

LysandreJik reviewed Dec 18, 2020

View reviewed changes

jplu marked this pull request as draft December 18, 2020 22:54

patrickvonplaten reviewed Dec 21, 2020

View reviewed changes

src/transformers/modeling_tf_utils.py Outdated Show resolved Hide resolved

patrickvonplaten reviewed Dec 21, 2020

View reviewed changes

src/transformers/modeling_tf_utils.py Outdated Show resolved Hide resolved

patrickvonplaten reviewed Dec 21, 2020

View reviewed changes

jplu marked this pull request as ready for review December 21, 2020 14:20

jplu force-pushed the rework-embedding-bias-resize branch from 9d52f5e to 657b1cb Compare December 21, 2020 17:37

sgugger reviewed Dec 23, 2020

View reviewed changes

LysandreJik approved these changes Jan 5, 2021

View reviewed changes

jplu mentioned this pull request Jan 5, 2021

New TF embeddings (cleaner and faster) #9418

Merged

LysandreJik requested review from patrickvonplaten and sgugger January 6, 2021 10:15

patrickvonplaten reviewed Jan 6, 2021

View reviewed changes

src/transformers/modeling_tf_utils.py Outdated Show resolved Hide resolved

patrickvonplaten reviewed Jan 6, 2021

View reviewed changes

src/transformers/modeling_tf_utils.py Show resolved Hide resolved

patrickvonplaten reviewed Jan 6, 2021

View reviewed changes

src/transformers/modeling_tf_utils.py Show resolved Hide resolved

jplu added 20 commits January 7, 2021 11:57

Rework T5

7082cfe

Rework xlm+xlnet

9baf654

Rework template

f8e6f1b

Fix TFT5EncoderOnly + DPRs

98201a3

Restore previous methods

d1165b1

Fix Funnel

b7071f9

Fix CTRL and TransforXL

6795e73

Apply style

09e7602

Apply Sylvain's comments

008ca77

Restore a test in DPR

59086de

Address the comments

fbeb6c8

Fix bug

47a07aa

Apply style

4927da0

remove unused import

af9ee3c

Fix test

969e7db

Forgot a method

8197422

missing test

951f899

Trigger CI

1f1dcef

naming update

22414d9

Rebase

5137cb3

jplu force-pushed the rework-embedding-bias-resize branch from fa46ce8 to 5137cb3 Compare January 7, 2021 11:04

Trigger CI

48fa8c1

sgugger approved these changes Jan 8, 2021

View reviewed changes

LysandreJik merged commit 1243ee7 into huggingface:master Jan 11, 2021

jplu deleted the rework-embedding-bias-resize branch January 11, 2021 11:43

		def get_input_embeddings(self) -> tf.keras.layers.Layer:
		def get_input_embeddings(self) -> tf.Variable:

Full rework of the TF input/output embeddings and bias resizing #9193

Full rework of the TF input/output embeddings and bias resizing #9193

Conversation

jplu commented Dec 18, 2020 • edited Loading

What does this PR do?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sgugger commented Dec 18, 2020

jplu commented Dec 18, 2020

jplu commented Dec 18, 2020

jplu commented Dec 18, 2020 • edited Loading

patrickvonplaten left a comment

Choose a reason for hiding this comment

jplu commented Dec 21, 2020 • edited Loading

sgugger commented Dec 21, 2020

jplu commented Dec 21, 2020

jplu commented Dec 21, 2020 • edited Loading

sgugger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jplu commented Dec 24, 2020

LysandreJik left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jplu Jan 5, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jplu commented Jan 5, 2021 • edited Loading

LysandreJik commented Jan 6, 2021

jplu commented Jan 6, 2021

jplu commented Jan 7, 2021

jplu commented Jan 7, 2021

patrickvonplaten commented Jan 7, 2021

jplu commented Jan 8, 2021

sgugger left a comment

Choose a reason for hiding this comment

jplu commented Dec 18, 2020 •

edited

Loading

jplu commented Dec 18, 2020 •

edited

Loading

jplu commented Dec 21, 2020 •

edited

Loading

jplu commented Dec 21, 2020 •

edited

Loading

jplu Jan 5, 2021 •

edited

Loading

jplu commented Jan 5, 2021 •

edited

Loading