Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full rework of the TF input/output embeddings and bias resizing #9193

Merged
merged 53 commits into from
Jan 11, 2021

Conversation

jplu
Copy link
Contributor

@jplu jplu commented Dec 18, 2020

What does this PR do?

This PR 100% reworks the entire process of input/output and bias resizing. Now the exceptions are better handled including the names that now are always similar. The corresponding tests have also been entirely reworked and now have a better coverage of this feature.

This PR adds a small breaking change. Now the get_input_embeddings methods returns the weights and not anymore the embedding layer.

def get_input_embeddings(self) -> tf.keras.layers.Layer:
def get_input_embeddings(self) -> tf.Variable:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like a breaking change, doesn't it? We're not returning the layer anymore, but the weights of that layer. Why is this change necessary?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the get_output_embeddings method still returns a tf.keras.layers.Layer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They all return a tf.Variable that are the embeddings/bias weights. This is now necessary in order to be sure to have the proper attribute and not guessing if it is word_embeddings, weights, bias, decoder_bias, etc... and to make this as generic as possible, including for the naming.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By "naming" I mean that now, we don't need anymore to manually build the name of each resized weights as thanks to these changes, they are always the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_input_embeddings still returns a layer since the last commit 👍

@sgugger
Copy link
Collaborator

sgugger commented Dec 18, 2020

I haven't reviewed in detail yet, but just looking at the API with the number of things to change for ALBERT (and in terms of line of code) is a hard pass for me. Overriding the resize method as was done before was way easier, this adds too much complexity.

@jplu
Copy link
Contributor Author

jplu commented Dec 18, 2020

I understand that it is a big update. Nevertheless, the way it was done before didn't worked and was quite buggy (the tests basically was testing almost nothing) and to make the resizing properly working, these changes are necessary.

@jplu
Copy link
Contributor Author

jplu commented Dec 18, 2020

In all cases I'm open to any suggestion that will reduce the number of changes :)

@jplu jplu marked this pull request as draft December 18, 2020 22:54
@jplu
Copy link
Contributor Author

jplu commented Dec 18, 2020

@sgugger @LysandreJik I tried a new approach for the resizing that reduce a lot the changes in each model implementation, it is even much shorter than what we currently have in master. I have done my test only on ALBERT for now, can you recheck that file and let me know what you think about it.

Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jplu - can you clarify a bit why we need this new "TF input/output embeddings resizing" PR and why the previous PR: #8657 was unsatisfactory?

@jplu
Copy link
Contributor Author

jplu commented Dec 21, 2020

Ok I will clarify this a bit more:

  1. The fist most important issue in the current implementation is that the resizing is not graph compilation+ execution compliant because of the usage of numpy and tensor.numpy() calls and then not usable in such cases.
  2. The naming was depending of where the build was coming from, that's why we needed the get_prefix_bias_name for the bias and the manual build of the embeddings names. Which was a temporary fix, because it is very error prone, because the naming depends of several other things that are not taken into account into this manual build.
  3. Resizing was not working for some models, such as BART which doesn't work properly and raises an error. (Proof that the tests was not testing everything)
  4. The current resizing has two issues when resizing when we instantiate a model from scratch: either it raises an attribute error because the model is not fully built (weights not instantiated) and get a wrong naming, and then if we save the model with this wrong naming we cannot save/load it properly because the naming doesn't correspond to the current architecture.
  5. All the weights names across the models don't share the same names sometimes embeddings.word_embeddings, sometimes shared.weight, sometimes lm_head.bias, sometimes lm_loss.bias, sometimes mlm.bias, sometimes mlm.predictions.bias and many other ways...

As stated in #8657 this was just a temporary fix to go to the quickest way to wait for the real rework.

This PR aims to solve all these issues and bring something more generic and less error prone.

@jplu jplu marked this pull request as ready for review December 21, 2020 14:20
@sgugger
Copy link
Collaborator

sgugger commented Dec 21, 2020

I personally never understood that #8657 was a quick fix that was needing another PR afterwards. We cannot operate by adding new methods in one release then breaking them or deleting them in the next so the works that was done in #8657 needs to be built upon not destroyed (and please, say in bold next time you are just making a quick fix as I would never have approved #8657 to be merged had I known...)

So before we review this, the following need to be addressed:

  • get_input_embeddings needs to keep the same return type
  • get_output_embeddings needs to keep the same return type
  • get_output_layer_with_bias can't disappear
  • get_prefix_bias_name can't disappear

This is annoying but this is why we usually don't merge a half-baked fix introducing new APIs, we can't break that after.

@jplu
Copy link
Contributor Author

jplu commented Dec 21, 2020

We can keep this for the next major release.

What you ask is doable but will make the codebase more complicated. I will rework this.

@jplu
Copy link
Contributor Author

jplu commented Dec 21, 2020

I have just done the following restore:

  • get_input_embeddings still returns a layer
  • get_output_embeddings still returns a layer
  • get_output_layer_with_bias is back
  • get_prefix_bias_name is back

The old and new approach was much more compliant than I thought so it was easier to restore what @sgugger asked, and now there should be zero breaking change. Really sorry for the misunderstanding, I will clearer next time.

@jplu jplu force-pushed the rework-embedding-bias-resize branch from 9d52f5e to 657b1cb Compare December 21, 2020 17:37
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better without the breaking changes, thanks for adapting.

Reviewing this, I see a lot of:

   def get_output_embeddings(self):
        return self.input_embeddings

    def set_output_embeddings(self, value):
        self.input_embeddings.weight = value
        self.input_embeddings.vocab_size = shape_list(value)[0]

    def get_bias(self):
        return {"bias": self.bias}

    def set_bias(self, value):
        self.bias = value["bias"]
        self.vocab_size = shape_list(value["bias"])[0]

which makes me think there is a better way to code the default in modeling_tf_utils.

At the same time, I have a bad feeling this is currently breaking the weight-tying which is currently untested (cf my comment in _resize_token_embeddings, L803) so I think we should first have a test of the weight-tying before merging.

@@ -534,6 +534,34 @@ def load_tf_weights(model, resolved_archive_file):
return missing_layers, unexpected_layers


def init_copy_embeddings(old_embeddings, new_num_tokens):
old_num_tokens, old_embedding_dim = shape_list(old_embeddings)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A dosctring explaining what this function does would be great!

Comment on lines 689 to 690
)
return None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should return self.get_lm_head().

old_embeddings = self._find_weights(self.get_input_embeddings())
new_embeddings = self._get_resized_embeddings(old_embeddings, new_num_tokens)

# if word embeddings are not tied, make sure that lm head bias is resized as well
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment needs updating, it doesn't apply here.

Comment on lines 928 to 932
def get_output_layer_with_bias(self):
warnings.warn(
"The method get_output_layer_with_bias is deprecated. Please use `get_lm_head` instead.", FutureWarning
)
return self.mlm.predictions
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be here anymore. By letting the base method in modeling_tf_utils call the self.get_lm_head, we can remove all the overrides of get_output_layer_with_bias in the modeling files.

Comment on lines 1047 to 1051
def get_output_layer_with_bias(self):
warnings.warn(
"The method get_output_layer_with_bias is deprecated. Please use `get_lm_head` instead.", FutureWarning
)
return self.mlm.predictions
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for this one.

Comment on lines 858 to 861
def get_output_layer_with_bias(self):
warnings.warn(
"The method get_output_layer_with_bias is deprecated. Please use `get_lm_head` instead.", FutureWarning
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for this one.

Comment on lines 846 to 850
def get_output_layer_with_bias(self):
warnings.warn(
"The method get_output_layer_with_bias is deprecated. Please use `get_lm_head` instead.", FutureWarning
)
return self.lm_head
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for this one.

Comment on lines 818 to 822
def get_output_layer_with_bias(self):
warnings.warn(
"The method get_output_layer_with_bias is deprecated. Please use `get_lm_head` instead.", FutureWarning
)
return self.pred_layer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for this one.

Comment on lines 1236 to 1240
def get_output_layer_with_bias(self):
warnings.warn(
"The method get_output_layer_with_bias is deprecated. Please use `get_lm_head` instead.", FutureWarning
)
return self.lm_loss
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for this one.

self.set_bias(new_lm_head_bias)

# if word embeddings are not tied, make sure that lm head decoder is resized as well
if self.get_output_embeddings() is not None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In all BERT-like models with tied embeddings, this function returns self.input_embeddings, so not None. This is could then be breaking the tying of the weights (though it's probably just resizing a second time).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you continue to follow the logic in this if, you can see that in the _get_resized_lm_head_decoder() there is a test if input_embeddings and output_embeddings are equals with

if old_lm_head_decoder is not None and not is_input_output_equals

So yes, I confirm that the tying is well tested 👍

@jplu
Copy link
Contributor Author

jplu commented Dec 24, 2020

@sgugger I should have addressed all your comments :)

which makes me think there is a better way to code the default in modeling_tf_utils.

Share your thoughts 😉

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks better than the previous version, the tests are extensive enough. I find it a bit weird that get_bias returns a dict, but I understand why that's necessary.

Also it feels like a big bunch of the test_model_common_attributes could be refactored in the test_modeling_tf_common.py, but it's okay to leave it like this right now as it's explicit enough.

I may be missing something, but as Sylvain said, I'm not seeing any tests related to the tying itself.

tests/test_modeling_tf_dpr.py Outdated Show resolved Hide resolved
Comment on lines 782 to 818
def _find_weights(self, embedding_layer):
if hasattr(embedding_layer, "word_embeddings"):
return embedding_layer.word_embeddings
elif hasattr(embedding_layer, "weight"):
return embedding_layer.weight
elif hasattr(embedding_layer, "decoder"):
return embedding_layer.decoder
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should keep this method in mind when defining new LM head layers otherwise this would grow to be unmaintainable. It's unfortunate that we need to have such a mapping here, but I understand why it's necessary cf your comment here.

Copy link
Contributor Author

@jplu jplu Jan 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm currently rethinking the way we implement the embeddings and a part of this "rethinking" is how to get rid of these attribute detection, or at least minimize the side effects.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change the naming a bit maybe? just _find_weights is a bit too generic IMO => _get_word_embedding_weight ? Also (not sure here) maybe raise if nothing is found instead of silent None?

@jplu
Copy link
Contributor Author

jplu commented Jan 5, 2021

Thanks @LysandreJik! For the tying test look in the _get_resized_lm_head_decoder() method. Unless you mean adding a test in test_modeling_tf_common ?

@LysandreJik
Copy link
Member

I mean I'm not seeing a test that checks get_input_embeddings() == get_output_embeddings() when weights are tied, but I may be missing something here.

I know these two generally point to the same tensors, but no always, do they?

@jplu
Copy link
Contributor Author

jplu commented Jan 6, 2021

I mean I'm not seeing a test that checks get_input_embeddings() == get_output_embeddings() when weights are tied, but I may be missing something here.

Yes, there is a test for this, line 884 in modeling_tf_utils.

I know these two generally point to the same tensors, but no always, do they?

Yes they always point to the same tensor when they equals, 100% sure.

@jplu jplu force-pushed the rework-embedding-bias-resize branch from fa46ce8 to 5137cb3 Compare January 7, 2021 11:04
@jplu
Copy link
Contributor Author

jplu commented Jan 7, 2021

@patrickvonplaten the test TFLEDModelTest::test_pt_tf_model_equivalence seems very flaky, it looks like that it randomly pass/fail.

@jplu
Copy link
Contributor Author

jplu commented Jan 7, 2021

Good to merge for me now :)

@patrickvonplaten
Copy link
Contributor

@patrickvonplaten the test TFLEDModelTest::test_pt_tf_model_equivalence seems very flaky, it looks like that it randomly pass/fail.

Just fixed it: #9459

@jplu
Copy link
Contributor Author

jplu commented Jan 8, 2021

@sgugger any objection to merge this PR?

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@LysandreJik LysandreJik merged commit 1243ee7 into huggingface:master Jan 11, 2021
@jplu jplu deleted the rework-embedding-bias-resize branch January 11, 2021 11:43
guyrosin pushed a commit to guyrosin/transformers that referenced this pull request Jan 15, 2021
…ingface#9193)

* Start rework resizing

* Rework bias/decoder resizing

* Full resizing rework

* Full resizing rework

* Start to update the models with the new approach

* Finish to update the models

* Update all the tests

* Update the template

* Fix tests

* Fix tests

* Test a new approach

* Refactoring

* Refactoring

* Refactoring

* New rework

* Rework BART

* Rework bert+blenderbot

* Rework CTRL

* Rework Distilbert

* Rework DPR

* Rework Electra

* Rework Flaubert

* Rework Funnel

* Rework GPT2

* Rework Longformer

* Rework Lxmert

* Rework marian+mbart

* Rework mobilebert

* Rework mpnet

* Rework openai

* Rework pegasus

* Rework Roberta

* Rework T5

* Rework xlm+xlnet

* Rework template

* Fix TFT5EncoderOnly + DPRs

* Restore previous methods

* Fix Funnel

* Fix CTRL and TransforXL

* Apply style

* Apply Sylvain's comments

* Restore a test in DPR

* Address the comments

* Fix bug

* Apply style

* remove unused import

* Fix test

* Forgot a method

* missing test

* Trigger CI

* naming update

* Rebase

* Trigger CI
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants