-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Full rework of the TF input/output embeddings and bias resizing #9193
Full rework of the TF input/output embeddings and bias resizing #9193
Conversation
def get_input_embeddings(self) -> tf.keras.layers.Layer: | ||
def get_input_embeddings(self) -> tf.Variable: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems like a breaking change, doesn't it? We're not returning the layer anymore, but the weights of that layer. Why is this change necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And the get_output_embeddings
method still returns a tf.keras.layers.Layer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They all return a tf.Variable
that are the embeddings/bias weights. This is now necessary in order to be sure to have the proper attribute and not guessing if it is word_embeddings
, weights
, bias
, decoder_bias
, etc... and to make this as generic as possible, including for the naming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By "naming" I mean that now, we don't need anymore to manually build the name of each resized weights as thanks to these changes, they are always the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_input_embeddings
still returns a layer since the last commit 👍
I haven't reviewed in detail yet, but just looking at the API with the number of things to change for ALBERT (and in terms of line of code) is a hard pass for me. Overriding the resize method as was done before was way easier, this adds too much complexity. |
I understand that it is a big update. Nevertheless, the way it was done before didn't worked and was quite buggy (the tests basically was testing almost nothing) and to make the resizing properly working, these changes are necessary. |
In all cases I'm open to any suggestion that will reduce the number of changes :) |
@sgugger @LysandreJik I tried a new approach for the resizing that reduce a lot the changes in each model implementation, it is even much shorter than what we currently have in master. I have done my test only on ALBERT for now, can you recheck that file and let me know what you think about it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I will clarify this a bit more:
As stated in #8657 this was just a temporary fix to go to the quickest way to wait for the real rework. This PR aims to solve all these issues and bring something more generic and less error prone. |
I personally never understood that #8657 was a quick fix that was needing another PR afterwards. We cannot operate by adding new methods in one release then breaking them or deleting them in the next so the works that was done in #8657 needs to be built upon not destroyed (and please, say in bold next time you are just making a quick fix as I would never have approved #8657 to be merged had I known...) So before we review this, the following need to be addressed:
This is annoying but this is why we usually don't merge a half-baked fix introducing new APIs, we can't break that after. |
We can keep this for the next major release. What you ask is doable but will make the codebase more complicated. I will rework this. |
I have just done the following restore:
The old and new approach was much more compliant than I thought so it was easier to restore what @sgugger asked, and now there should be zero breaking change. Really sorry for the misunderstanding, I will clearer next time. |
9d52f5e
to
657b1cb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better without the breaking changes, thanks for adapting.
Reviewing this, I see a lot of:
def get_output_embeddings(self):
return self.input_embeddings
def set_output_embeddings(self, value):
self.input_embeddings.weight = value
self.input_embeddings.vocab_size = shape_list(value)[0]
def get_bias(self):
return {"bias": self.bias}
def set_bias(self, value):
self.bias = value["bias"]
self.vocab_size = shape_list(value["bias"])[0]
which makes me think there is a better way to code the default in modeling_tf_utils
.
At the same time, I have a bad feeling this is currently breaking the weight-tying which is currently untested (cf my comment in _resize_token_embeddings
, L803) so I think we should first have a test of the weight-tying before merging.
@@ -534,6 +534,34 @@ def load_tf_weights(model, resolved_archive_file): | |||
return missing_layers, unexpected_layers | |||
|
|||
|
|||
def init_copy_embeddings(old_embeddings, new_num_tokens): | |||
old_num_tokens, old_embedding_dim = shape_list(old_embeddings) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A dosctring explaining what this function does would be great!
) | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should return self.get_lm_head()
.
old_embeddings = self._find_weights(self.get_input_embeddings()) | ||
new_embeddings = self._get_resized_embeddings(old_embeddings, new_num_tokens) | ||
|
||
# if word embeddings are not tied, make sure that lm head bias is resized as well |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment needs updating, it doesn't apply here.
def get_output_layer_with_bias(self): | ||
warnings.warn( | ||
"The method get_output_layer_with_bias is deprecated. Please use `get_lm_head` instead.", FutureWarning | ||
) | ||
return self.mlm.predictions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be here anymore. By letting the base method in modeling_tf_utils
call the self.get_lm_head
, we can remove all the overrides of get_output_layer_with_bias
in the modeling files.
def get_output_layer_with_bias(self): | ||
warnings.warn( | ||
"The method get_output_layer_with_bias is deprecated. Please use `get_lm_head` instead.", FutureWarning | ||
) | ||
return self.mlm.predictions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for this one.
def get_output_layer_with_bias(self): | ||
warnings.warn( | ||
"The method get_output_layer_with_bias is deprecated. Please use `get_lm_head` instead.", FutureWarning | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for this one.
def get_output_layer_with_bias(self): | ||
warnings.warn( | ||
"The method get_output_layer_with_bias is deprecated. Please use `get_lm_head` instead.", FutureWarning | ||
) | ||
return self.lm_head |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for this one.
def get_output_layer_with_bias(self): | ||
warnings.warn( | ||
"The method get_output_layer_with_bias is deprecated. Please use `get_lm_head` instead.", FutureWarning | ||
) | ||
return self.pred_layer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for this one.
def get_output_layer_with_bias(self): | ||
warnings.warn( | ||
"The method get_output_layer_with_bias is deprecated. Please use `get_lm_head` instead.", FutureWarning | ||
) | ||
return self.lm_loss |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for this one.
self.set_bias(new_lm_head_bias) | ||
|
||
# if word embeddings are not tied, make sure that lm head decoder is resized as well | ||
if self.get_output_embeddings() is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In all BERT-like models with tied embeddings, this function returns self.input_embeddings
, so not None
. This is could then be breaking the tying of the weights (though it's probably just resizing a second time).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you continue to follow the logic in this if
, you can see that in the _get_resized_lm_head_decoder()
there is a test if input_embeddings
and output_embeddings
are equals with
if old_lm_head_decoder is not None and not is_input_output_equals
So yes, I confirm that the tying is well tested 👍
@sgugger I should have addressed all your comments :)
Share your thoughts 😉 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks better than the previous version, the tests are extensive enough. I find it a bit weird that get_bias
returns a dict, but I understand why that's necessary.
Also it feels like a big bunch of the test_model_common_attributes
could be refactored in the test_modeling_tf_common.py
, but it's okay to leave it like this right now as it's explicit enough.
I may be missing something, but as Sylvain said, I'm not seeing any tests related to the tying itself.
def _find_weights(self, embedding_layer): | ||
if hasattr(embedding_layer, "word_embeddings"): | ||
return embedding_layer.word_embeddings | ||
elif hasattr(embedding_layer, "weight"): | ||
return embedding_layer.weight | ||
elif hasattr(embedding_layer, "decoder"): | ||
return embedding_layer.decoder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should keep this method in mind when defining new LM head layers otherwise this would grow to be unmaintainable. It's unfortunate that we need to have such a mapping here, but I understand why it's necessary cf your comment here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm currently rethinking the way we implement the embeddings and a part of this "rethinking" is how to get rid of these attribute detection, or at least minimize the side effects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change the naming a bit maybe? just _find_weights
is a bit too generic IMO => _get_word_embedding_weight
? Also (not sure here) maybe raise if nothing is found instead of silent None?
Thanks @LysandreJik! For the tying test look in the |
I mean I'm not seeing a test that checks I know these two generally point to the same tensors, but no always, do they? |
Yes, there is a test for this, line 884 in
Yes they always point to the same tensor when they equals, 100% sure. |
fa46ce8
to
5137cb3
Compare
@patrickvonplaten the test |
Good to merge for me now :) |
Just fixed it: #9459 |
@sgugger any objection to merge this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
…ingface#9193) * Start rework resizing * Rework bias/decoder resizing * Full resizing rework * Full resizing rework * Start to update the models with the new approach * Finish to update the models * Update all the tests * Update the template * Fix tests * Fix tests * Test a new approach * Refactoring * Refactoring * Refactoring * New rework * Rework BART * Rework bert+blenderbot * Rework CTRL * Rework Distilbert * Rework DPR * Rework Electra * Rework Flaubert * Rework Funnel * Rework GPT2 * Rework Longformer * Rework Lxmert * Rework marian+mbart * Rework mobilebert * Rework mpnet * Rework openai * Rework pegasus * Rework Roberta * Rework T5 * Rework xlm+xlnet * Rework template * Fix TFT5EncoderOnly + DPRs * Restore previous methods * Fix Funnel * Fix CTRL and TransforXL * Apply style * Apply Sylvain's comments * Restore a test in DPR * Address the comments * Fix bug * Apply style * remove unused import * Fix test * Forgot a method * missing test * Trigger CI * naming update * Rebase * Trigger CI
What does this PR do?
This PR 100% reworks the entire process of input/output and bias resizing. Now the exceptions are better handled including the names that now are always similar. The corresponding tests have also been entirely reworked and now have a better coverage of this feature.
This PR adds a small breaking change. Now the
get_input_embeddings
methods returns the weights and not anymore the embedding layer.