-
Notifications
You must be signed in to change notification settings - Fork 27.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate DeBERTa v2(the 1.5B model surpassed human performance on Su… #10018
Conversation
26c07e1
to
0f394a2
Compare
Hi @BigBird01, thank you for opening the PR! Can you let me know once you're satisfied with your changes so that we can take a look? Thank you! |
I see, thanks. As mentioned by e-mail, I think the correct approach here is to create a Can I handle that for you? |
But I think the current implementation is better. First ,the current changes not only contain the new features of v2 but also some improvements to v1. Second, the change between v2 and v1 is small. I also tested all the models with current implementation, and I didn't find any regression. Third and the most important, by creating another folder for deberta-v2 we need to add redundant code and tests to cover v2. This may introduce additional maintain effort in the future. Let me know what's your thought. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issues with modifying the code of the first version are:
- We might inadvertently modify some of the behavior of the past model
- We don't know what is the difference between the first and second version
For example here the DisentangledSelfAttention
layer gets radically changed, with some layer name changes, which makes me dubious that you can load first version checkpoints inside.
Finally, you make a good point regarding maintainability. However, we can still enforce this by building some tools which ensure that the code does not diverge. We have this setup for a multitude of models, for example BART is very similar to mBART, Pegasus, Marian.
Please take a look at the mBART code and look for the "# Copied from ..." comments, such as the following:
transformers/src/transformers/models/mbart/modeling_mbart.py
Lines 96 to 108 in 3be965c
# Copied from transformers.models.bart.modeling_bart._expand_mask | |
def _expand_mask(mask: torch.Tensor, dtype: torch.dtype, tgt_len: Optional[int] = None): | |
""" | |
Expands attention_mask from `[bsz, seq_len]` to `[bsz, 1, tgt_seq_len, src_seq_len]`. | |
""" | |
bsz, src_len = mask.size() | |
tgt_len = tgt_len if tgt_len is not None else src_len | |
expanded_mask = mask[:, None, None, :].expand(bsz, 1, tgt_len, src_len).to(dtype) | |
inverted_mask = 1.0 - expanded_mask | |
return inverted_mask.masked_fill(inverted_mask.bool(), torch.finfo(dtype).min) |
This ensures that the two implementations do not diverge, it helps identify where the code is different, and it is what we've chosen to go through in order to keep readability to a maximum.
In that case, just feel free to take over the change and follow the rule to merge it to master. Please let me know when you finish it and I will take a test over it. Thanks in advance @lysandre.debut@reseau.eseo.fr<mailto:lysandre.debut@reseau.eseo.fr>
Get Outlook for iOS<https://aka.ms/o0ukef>
…________________________________
From: Lysandre Debut <notifications@github.com>
Sent: Thursday, February 4, 2021 10:55:24 PM
To: huggingface/transformers <transformers@noreply.github.com>
Cc: Pengcheng He <biglm100@gmail.com>; Mention <mention@noreply.github.com>
Subject: Re: [huggingface/transformers] Integrate DeBERTa v2(the 1.5B model surpassed human performance on Su… (#10018)
@LysandreJik commented on this pull request.
The issues with modifying the code of the first version are:
* We might inadvertently modify some of the behavior of the past model
* We don't know what is the difference between the first and second version
For example here the DisentangledSelfAttention layer gets radically changed, with some layer name changes, which makes me dubious that you can load first version checkpoints inside.
Finally, you make a good point regarding maintainability. However, we can still enforce this by building some tools which ensure that the code does not diverge. We have this setup for a multitude of models, for example BART is very similar to mBART, Pegasus, Marian.
Please take a look at the mBART code and look for the "# Copied from ..." comments, such as the following:
https://github.com/huggingface/transformers/blob/3be965c5dbee794a7a3606df6a1ae36a0d65904d/src/transformers/models/mbart/modeling_mbart.py#L96-L108<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fhuggingface%2Ftransformers%2Fblob%2F3be965c5dbee794a7a3606df6a1ae36a0d65904d%2Fsrc%2Ftransformers%2Fmodels%2Fmbart%2Fmodeling_mbart.py%23L96-L108&data=04%7C01%7CPengcheng.H%40microsoft.com%7Ce6bf8235a2654c89b0bc08d8c9a30491%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637481049290703674%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=pvA7y0lA326x2LKdGrnEaFbrjl7AI5tVvk3fotCeQM0%3D&reserved=0>
This ensures that the two implementations do not diverge, it helps identify where the code is different, and it is what we've chosen to go through in order to keep readability to a maximum.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fhuggingface%2Ftransformers%2Fpull%2F10018%23pullrequestreview-584064416&data=04%7C01%7CPengcheng.H%40microsoft.com%7Ce6bf8235a2654c89b0bc08d8c9a30491%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637481049290703674%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=6taj4jBCjpauA8sHZETQSZgUWxe7IaPYLDzorhvY4EE%3D&reserved=0>, or unsubscribe<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAJDNDRUB2E3QCL4LSA6B34TS5OI5ZANCNFSM4XDYVO7A&data=04%7C01%7CPengcheng.H%40microsoft.com%7Ce6bf8235a2654c89b0bc08d8c9a30491%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637481049290713636%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=q2G2HWxOOd1Hz2KJzEAeGDm8QTVolDlkkoLFrCUyrsE%3D&reserved=0>.
|
This works for me, thank you for your understanding. I'll ping you once the PR can be reviewed. |
def _pre_load_hook(self, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs): | ||
self_state = self.state_dict() | ||
if ((prefix + "query_proj.weight") not in state_dict) and ((prefix + "in_proj.weight") in state_dict): | ||
v1_proj = state_dict[prefix + "in_proj.weight"] | ||
v1_proj = v1_proj.unsqueeze(0).reshape(self.num_attention_heads, -1, v1_proj.size(-1)) | ||
q, k, v = v1_proj.chunk(3, dim=1) | ||
state_dict[prefix + "query_proj.weight"] = q.reshape(-1, v1_proj.size(-1)) | ||
state_dict[prefix + "key_proj.weight"] = k.reshape(-1, v1_proj.size(-1)) | ||
state_dict[prefix + "key_proj.bias"] = self_state["key_proj.bias"] | ||
state_dict[prefix + "value_proj.weight"] = v.reshape(-1, v1_proj.size(-1)) | ||
v1_query_bias = state_dict[prefix + "q_bias"] | ||
state_dict[prefix + "query_proj.bias"] = v1_query_bias | ||
v1_value_bias = state_dict[prefix + "v_bias"] | ||
state_dict[prefix + "value_proj.bias"] = v1_value_bias | ||
|
||
v1_pos_key_proj = state_dict[prefix + "pos_proj.weight"] | ||
state_dict[prefix + "pos_key_proj.weight"] = v1_pos_key_proj | ||
v1_pos_query_proj = state_dict[prefix + "pos_q_proj.weight"] | ||
state_dict[prefix + "pos_query_proj.weight"] = v1_pos_query_proj | ||
v1_pos_query_proj_bias = state_dict[prefix + "pos_q_proj.bias"] | ||
state_dict[prefix + "pos_query_proj.bias"] = v1_pos_query_proj_bias | ||
state_dict[prefix + "pos_key_proj.bias"] = self_state["pos_key_proj.bias"] | ||
|
||
del state_dict[prefix + "in_proj.weight"] | ||
del state_dict[prefix + "q_bias"] | ||
del state_dict[prefix + "v_bias"] | ||
del state_dict[prefix + "pos_proj.weight"] | ||
del state_dict[prefix + "pos_q_proj.weight"] | ||
del state_dict[prefix + "pos_q_proj.bias"] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BigBird01, could you comment on what is this needed for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. In v2 we used a different format to store the attention projection matrix, i.e. q,k,v. In v1, we concatenate them together to benefit from large batch matrix multiplication. However, we found this is not convenient if we want to make some change on that. So in v2 we fall back to the original design that use separate projection matrix for q, k, v. This piece of code is used to convert v1 projection matrix to v2 format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're splitting v1 and v2, there's no need for this anymore, and I believe this method doesn't make comprehension of that model easier. Would you be okay for me to remove this part and update the model weights on the hub? The v1 models will still be able to be loaded in the DebertaModel
, but not in the DebertaV2Model
.
Are you okay with that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. We can remove this part of code since we are going to separate the code. The mode has already been updated, so all we need is to update the code and integrate it with master branch.
PR to split the two models is here: BigBird01#1 |
@BigBird01 just wanted to ask if the new additions involve the base and large versions of v2 as well, because i saw that new base and large deberta models were added as well, or will they be just v1? |
For v2 we don't have base and large yet. But we will add them in the future. |
Are there any bottlenecks preventing this from being merged? |
I think @LysandreJik will merge the changes to master soon.
Thanks @LysandreJik. I just reviewed the PR and I'm good with it
|
After playing around with the model, I don't think we need pre-load hooks after all. In order to load the MNLI checkpoints, you just need to specify to the model that it needs three labels. It can be done as follows: from transformers import DebertaV2ForSequenceClassification
model = DebertaV2ForSequenceClassification.from_pretrained("microsoft/deberta-v2-xlarge-mnli", num_labels=3) But this should be taken care of in the configuration. I believe all your MNLI model configurations should have the Following this, I found a few issues with the XLARGE MNLI checkpoint. When loading it in the
|
Apart from the two issues mentioned above, the PR looks in a good state to me. Would you mind:
I can take care of 2) and 3) if you want. |
…perGLUE); Add DeBERTa v2 900M,1.5B models;
4ceadda
to
74923f4
Compare
Thanks @LysandreJik. I just fixed the model issue and resolved the merge conflicts. |
74923f4
to
c4f57ca
Compare
Thank you for taking care of those issues. @patrickvonplaten @sgugger, could you give this one a look? The unresolved issue is regarding the pre-load hooks. Loading a pre-trained model that already has a classification head with a different number of labels will not work, as the weight will have the wrong numbers of parameters. Until now, we've been doing: from transformers import DebertaV2Model, DebertaV2ForSequenceClassification
seq_model = DebertaV2ForSequenceClassification.from_pretrained("xxx", num_labels=4)
seq_model.save_pretrained(directory)
base = DebertaV2Model.from_pretrained(directory) # Lose the head
base.save_pretrained(directory)
seq_model = DebertaV2ForSequenceClassification.from_pretrained(directory, num_labels=8) The pre-load hook that @BigBird01 worked on drops the head instead when it finds it is ill-loaded. I'm okay to merge it like this, and I'll work on a model-agnostic approach this week. Let me know your thoughts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for adding this model! Just left a few nits.
With respect to the pre-hook load, I'm fine with merging the functionality for this model before looking at a more general way of having it for all models.
@LysandreJik Thanks for the fix. |
@@ -54,7 +58,7 @@ def __init__(self, config): | |||
self.dropout = StableDropout(config.pooler_dropout) | |||
self.config = config | |||
|
|||
def forward(self, hidden_states, mask=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch
return self.drop_prob | ||
|
||
|
||
def MaskedLayerNorm(layerNorm, input, mask=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think this design choice is a bit confusing. The function is only applied once and it's a bit surprising to me that it's upper case.
I would simply add the necessary code to the only occurrence of MaskedLayerNorm
below - it makes the code more readable IMO
rmask = (1 - input_mask).bool() | ||
out.masked_fill_(rmask.unsqueeze(-1).expand(out.size()), 0) | ||
out = ACT2FN[self.conv_act](self.dropout(out)) | ||
output_states = MaskedLayerNorm(self.LayerNorm, residual_states + out, input_mask) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only time MaskedLayerNorm
is used. I would simply add the logic required for MaskedLayerNorm
here -> it improves readability a lot
self.LayerNorm = LayerNorm(config.hidden_size, config.layer_norm_eps, elementwise_affine=True) | ||
|
||
kernel_size = getattr(config, "conv_kernel_size", 0) | ||
self.with_conv = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we maybe just have:
self.conv = ConvLayer(config) if getattr(config, "conv_kernel_size", 0) > 0 else None
and delete self.with_conv
? Then below we can just check whether self.conv is None or not -> it saves us a couple of lines and the attribute self.with_conv
which is hardcoded here
self.with_conv = True | ||
self.conv = ConvLayer(config) | ||
|
||
def get_rel_embedding(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function seems to compute the same output everytime it's being called -> can we maybe just save the output in init instead of computing it again?
self, | ||
hidden_states, | ||
attention_mask, | ||
output_hidden_states=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be set to False
by default I think
|
||
|
||
def make_log_bucket_position(relative_pos, bucket_size, max_position): | ||
sign = np.sign(relative_pos) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe in a follow up PR we can have this in PyTorch - but ok for me for now!
"heads (%d)" % (config.hidden_size, config.num_attention_heads) | ||
) | ||
self.num_attention_heads = config.num_attention_heads | ||
_attention_head_size = int(config.hidden_size / config.num_attention_heads) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) this is the same as config.hidden_size // config.num_attention_heads
) | ||
|
||
# bxhxlxd | ||
_attention_probs = XSoftmax.apply(attention_scores, attention_mask, -1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why _attention_probs
instead of attention_probs
here?
relative_pos = relative_pos.unsqueeze(0).unsqueeze(0) | ||
elif relative_pos.dim() == 3: | ||
relative_pos = relative_pos.unsqueeze(1) | ||
# bxhxqxk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) these comments look a bit crypitc -> maybe (bsz x ... x ...)
would be better
if not os.path.isfile(vocab_file): | ||
raise ValueError( | ||
"Can't find a vocabulary file at path '{}'. To load the vocabulary from a Google pretrained " | ||
"model use `tokenizer = XxxTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`".format(vocab_file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"model use `tokenizer = XxxTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`".format(vocab_file) | |
"model use `tokenizer = DebertaV2Tokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`".format(vocab_file) |
@@ -773,6 +775,18 @@ def _init_weights(self, module): | |||
if isinstance(module, nn.Linear) and module.bias is not None: | |||
module.bias.data.zero_() | |||
|
|||
def _pre_load_hook(self, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add some more explanation here on what this does. Also, we are sure that this doesn't have any backward breaking changes for deberta_v1 no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Thanks so much for adding this super important model @BigBird01 ! I left a couple of comments in the modeling_deberta_v2.py
file - it would be great if we can make the code a bit cleaner there, e.g.:
- remove the
use_conv
attribute - set
output_hidden_states=False
as a default - refactor the
MaskLayerNorm
class
Those changes should be pretty trivial - thanks so much for all your work!
Thank you @patrickvonplaten! I will take a look at it soon. |
As seen with @BigBird01, taking over the PR! |
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
086698e
to
8fae021
Compare
Thank you @LysandreJik ! |
My pleasure! Thank you for your work! |
What does this PR do?
Integrate DeBERTa v2
The 1.5B XXLarge-V2 model is the model that surpass human performance and T5 11B on SuperGLUE leaderboard.
Fixes # (issue)
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors which may be interested in your PR.