Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate DeBERTa v2(the 1.5B model surpassed human performance on Su… #10018

Merged
merged 8 commits into from
Feb 19, 2021

Conversation

BigBird01
Copy link
Contributor

@BigBird01 BigBird01 commented Feb 4, 2021

What does this PR do?

Integrate DeBERTa v2

  1. Add DeBERTa XLarge model, DeBERTa v2 XLarge model, XXLarge model
Model Parameters MNLI-m/mm
Base 140M 88.8/88.6
Large 400M 91.3/91.1
XLarge 750M 91.5/91.2
V2-XLarge 900M 91.7/91.6
V2-XXLarge 1.5B 91.7/91.9

The 1.5B XXLarge-V2 model is the model that surpass human performance and T5 11B on SuperGLUE leaderboard.

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors which may be interested in your PR.

@BigBird01 BigBird01 force-pushed the penhe/debertav2 branch 3 times, most recently from 26c07e1 to 0f394a2 Compare February 5, 2021 02:44
@LysandreJik
Copy link
Member

Hi @BigBird01, thank you for opening the PR! Can you let me know once you're satisfied with your changes so that we can take a look? Thank you!

@BigBird01
Copy link
Contributor Author

BigBird01 commented Feb 5, 2021 via email

@LysandreJik
Copy link
Member

I see, thanks. As mentioned by e-mail, I think the correct approach here is to create a deberta-v2 folder that contains all of the changes, rather than implementing changes in the original deberta folder.

Can I handle that for you?

@BigBird01
Copy link
Contributor Author

BigBird01 commented Feb 5, 2021

I see, thanks. As mentioned by e-mail, I think the correct approach here is to create a deberta-v2 folder that contains all of the changes, rather than implementing changes in the original deberta folder.

Can I handle that for you?

But I think the current implementation is better. First ,the current changes not only contain the new features of v2 but also some improvements to v1. Second, the change between v2 and v1 is small. I also tested all the models with current implementation, and I didn't find any regression. Third and the most important, by creating another folder for deberta-v2 we need to add redundant code and tests to cover v2. This may introduce additional maintain effort in the future.

Let me know what's your thought.

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issues with modifying the code of the first version are:

  • We might inadvertently modify some of the behavior of the past model
  • We don't know what is the difference between the first and second version

For example here the DisentangledSelfAttention layer gets radically changed, with some layer name changes, which makes me dubious that you can load first version checkpoints inside.

Finally, you make a good point regarding maintainability. However, we can still enforce this by building some tools which ensure that the code does not diverge. We have this setup for a multitude of models, for example BART is very similar to mBART, Pegasus, Marian.

Please take a look at the mBART code and look for the "# Copied from ..." comments, such as the following:

# Copied from transformers.models.bart.modeling_bart._expand_mask
def _expand_mask(mask: torch.Tensor, dtype: torch.dtype, tgt_len: Optional[int] = None):
"""
Expands attention_mask from `[bsz, seq_len]` to `[bsz, 1, tgt_seq_len, src_seq_len]`.
"""
bsz, src_len = mask.size()
tgt_len = tgt_len if tgt_len is not None else src_len
expanded_mask = mask[:, None, None, :].expand(bsz, 1, tgt_len, src_len).to(dtype)
inverted_mask = 1.0 - expanded_mask
return inverted_mask.masked_fill(inverted_mask.bool(), torch.finfo(dtype).min)

This ensures that the two implementations do not diverge, it helps identify where the code is different, and it is what we've chosen to go through in order to keep readability to a maximum.

@BigBird01
Copy link
Contributor Author

BigBird01 commented Feb 5, 2021 via email

@LysandreJik
Copy link
Member

This works for me, thank you for your understanding. I'll ping you once the PR can be reviewed.

@BigBird01
Copy link
Contributor Author

BigBird01 commented Feb 5, 2021 via email

Comment on lines 816 to 845
def _pre_load_hook(self, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs):
self_state = self.state_dict()
if ((prefix + "query_proj.weight") not in state_dict) and ((prefix + "in_proj.weight") in state_dict):
v1_proj = state_dict[prefix + "in_proj.weight"]
v1_proj = v1_proj.unsqueeze(0).reshape(self.num_attention_heads, -1, v1_proj.size(-1))
q, k, v = v1_proj.chunk(3, dim=1)
state_dict[prefix + "query_proj.weight"] = q.reshape(-1, v1_proj.size(-1))
state_dict[prefix + "key_proj.weight"] = k.reshape(-1, v1_proj.size(-1))
state_dict[prefix + "key_proj.bias"] = self_state["key_proj.bias"]
state_dict[prefix + "value_proj.weight"] = v.reshape(-1, v1_proj.size(-1))
v1_query_bias = state_dict[prefix + "q_bias"]
state_dict[prefix + "query_proj.bias"] = v1_query_bias
v1_value_bias = state_dict[prefix + "v_bias"]
state_dict[prefix + "value_proj.bias"] = v1_value_bias

v1_pos_key_proj = state_dict[prefix + "pos_proj.weight"]
state_dict[prefix + "pos_key_proj.weight"] = v1_pos_key_proj
v1_pos_query_proj = state_dict[prefix + "pos_q_proj.weight"]
state_dict[prefix + "pos_query_proj.weight"] = v1_pos_query_proj
v1_pos_query_proj_bias = state_dict[prefix + "pos_q_proj.bias"]
state_dict[prefix + "pos_query_proj.bias"] = v1_pos_query_proj_bias
state_dict[prefix + "pos_key_proj.bias"] = self_state["pos_key_proj.bias"]

del state_dict[prefix + "in_proj.weight"]
del state_dict[prefix + "q_bias"]
del state_dict[prefix + "v_bias"]
del state_dict[prefix + "pos_proj.weight"]
del state_dict[prefix + "pos_q_proj.weight"]
del state_dict[prefix + "pos_q_proj.bias"]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BigBird01, could you comment on what is this needed for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. In v2 we used a different format to store the attention projection matrix, i.e. q,k,v. In v1, we concatenate them together to benefit from large batch matrix multiplication. However, we found this is not convenient if we want to make some change on that. So in v2 we fall back to the original design that use separate projection matrix for q, k, v. This piece of code is used to convert v1 projection matrix to v2 format.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're splitting v1 and v2, there's no need for this anymore, and I believe this method doesn't make comprehension of that model easier. Would you be okay for me to remove this part and update the model weights on the hub? The v1 models will still be able to be loaded in the DebertaModel, but not in the DebertaV2Model.

Are you okay with that?

Copy link
Contributor Author

@BigBird01 BigBird01 Feb 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We can remove this part of code since we are going to separate the code. The mode has already been updated, so all we need is to update the code and integrate it with master branch.

@LysandreJik
Copy link
Member

PR to split the two models is here: BigBird01#1

@Shashi456
Copy link

@BigBird01 just wanted to ask if the new additions involve the base and large versions of v2 as well, because i saw that new base and large deberta models were added as well, or will they be just v1?

@BigBird01
Copy link
Contributor Author

@BigBird01 just wanted to ask if the new additions involve the base and large versions of v2 as well, because i saw that new base and large deberta models were added as well, or will they be just v1?

For v2 we don't have base and large yet. But we will add them in the future.

@hendrycks
Copy link
Contributor

Are there any bottlenecks preventing this from being merged?

@BigBird01
Copy link
Contributor Author

@BigBird01 just wanted to ask if the new additions involve the base and large versions of v2 as well, because i saw that new base and large deberta models were added as well, or will they be just v1?

I think @LysandreJik will merge the changes to master soon.

PR to split the two models is here: BigBird01#1

Thanks @LysandreJik. I just reviewed the PR and I'm good with it

Are there any bottlenecks preventing this from being merged?

@LysandreJik
Copy link
Member

After playing around with the model, I don't think we need pre-load hooks after all. In order to load the MNLI checkpoints, you just need to specify to the model that it needs three labels. It can be done as follows:

from transformers import DebertaV2ForSequenceClassification

model = DebertaV2ForSequenceClassification.from_pretrained("microsoft/deberta-v2-xlarge-mnli", num_labels=3)

But this should be taken care of in the configuration. I believe all your MNLI model configurations should have the num_labels field set to 3 in order to be loadable.


Following this, I found a few issues with the XLARGE MNLI checkpoint. When loading it in the DebertaForSequenceClassification model, I get the following messages:

Some weights of the model checkpoint at microsoft/deberta-xlarge-mnli were not used when initializing DebertaForSequenceClassification: ['deberta.encoder.layer.0.attention.self.query_proj.weight', 'deberta.encoder.layer.0.attention.self.query_proj.bias', 'deberta.encoder.layer.0.attention.self.key_proj.weight', 'deberta.encoder.layer.0.attention.self.key_proj.bias', 'deberta.encoder.layer.0.attention.self.value_proj.weight', 'deberta.encoder.layer.0.attention.self.value_proj.bias', 'deberta.encoder.layer.0.attention.self.pos_key_proj.weight', 'deberta.encoder.layer.0.attention.self.pos_key_proj.bias', 'deberta.encoder.layer.0.attention.self.pos_query_proj.weight', 'deberta.encoder.layer.0.attention.self.pos_query_proj.bias', 'deberta.encoder.layer.1.attention.self.query_proj.weight', 'deberta.encoder.layer.1.attention.self.query_proj.bias', 'deberta.encoder.layer.1.attention.self.key_proj.weight', 'deberta.encoder.layer.1.attention.self.key_proj.bias', 'deberta.encoder.layer.1.attention.self.value_proj.weight', 'deberta.encoder.layer.1.attention.self.value_proj.bias', 'deberta.encoder.layer.1.attention.self.pos_key_proj.weight', 'deberta.encoder.layer.1.attention.self.pos_key_proj.bias', 'deberta.encoder.layer.1.attention.self.pos_query_proj.weight', 'deberta.encoder.layer.1.attention.self.pos_query_proj.bias', 'deberta.encoder.layer.2.attention.self.query_proj.weight', 'deberta.encoder.layer.2.attention.self.query_proj.bias', 'deberta.encoder.layer.2.attention.self.key_proj.weight', 'deberta.encoder.layer.2.attention.self.key_proj.bias', 'deberta.encoder.layer.2.attention.self.value_proj.weight', 'deberta.encoder.layer.2.attention.self.value_proj.bias', 'deberta.encoder.layer.2.attention.self.pos_key_proj.weight', 'deberta.encoder.layer.2.attention.self.pos_key_proj.bias', 'deberta.encoder.layer.2.attention.self.pos_query_proj.weight', 'deberta.encoder.layer.2.attention.self.pos_query_proj.bias', 'deberta.encoder.layer.3.attention.self.query_proj.weight', 'deberta.encoder.layer.3.attention.self.query_proj.bias', 'deberta.encoder.layer.3.attention.self.key_proj.weight', 'deberta.encoder.layer.3.attention.self.key_proj.bias', 'deberta.encoder.layer.3.attention.self.value_proj.weight', 'deberta.encoder.layer.3.attention.self.value_proj.bias', 'deberta.encoder.layer.3.attention.self.pos_key_proj.weight', 'deberta.encoder.layer.3.attention.self.pos_key_proj.bias', 'deberta.encoder.layer.3.attention.self.pos_query_proj.weight', 'deberta.encoder.layer.3.attention.self.pos_query_proj.bias', 'deberta.encoder.layer.4.attention.self.query_proj.weight', 'deberta.encoder.layer.4.attention.self.query_proj.bias', 'deberta.encoder.layer.4.attention.self.key_proj.weight', 'deberta.encoder.layer.4.attention.self.key_proj.bias', 'deberta.encoder.layer.4.attention.self.value_proj.weight', 'deberta.encoder.layer.4.attention.self.value_proj.bias', 'deberta.encoder.layer.4.attention.self.pos_key_proj.weight', 'deberta.encoder.layer.4.attention.self.pos_key_proj.bias', 'deberta.encoder.layer.4.attention.self.pos_query_proj.weight', 'deberta.encoder.layer.4.attention.self.pos_query_proj.bias', 'deberta.encoder.layer.5.attention.self.query_proj.weight', 'deberta.encoder.layer.5.attention.self.query_proj.bias', 'deberta.encoder.layer.5.attention.self.key_proj.weight', 'deberta.encoder.layer.5.attention.self.key_proj.bias', 'deberta.encoder.layer.5.attention.self.value_proj.weight', 'deberta.encoder.layer.5.attention.self.value_proj.bias', 'deberta.encoder.layer.5.attention.self.pos_key_proj.weight', 'deberta.encoder.layer.5.attention.self.pos_key_proj.bias', 'deberta.encoder.layer.5.attention.self.pos_query_proj.weight', 'deberta.encoder.layer.5.attention.self.pos_query_proj.bias', 'deberta.encoder.layer.6.attention.self.query_proj.weight', 'deberta.encoder.layer.6.attention.self.query_proj.bias', 'deberta.encoder.layer.6.attention.self.key_proj.weight', 'deberta.encoder.layer.6.attention.self.key_proj.bias', 'deberta.encoder.layer.6.attention.self.value_proj.weight', 'deberta.encoder.layer.6.attention.self.value_proj.bias', 'deberta.encoder.layer.6.attention.self.pos_key_proj.weight', 'deberta.encoder.layer.6.attention.self.pos_key_proj.bias', 'deberta.encoder.layer.6.attention.self.pos_query_proj.weight', 'deberta.encoder.layer.6.attention.self.pos_query_proj.bias', 'deberta.encoder.layer.7.attention.self.query_proj.weight', 'deberta.encoder.layer.7.attention.self.query_proj.bias', 'deberta.encoder.layer.7.attention.self.key_proj.weight', 'deberta.encoder.layer.7.attention.self.key_proj.bias', 'deberta.encoder.layer.7.attention.self.value_proj.weight', 'deberta.encoder.layer.7.attention.self.value_proj.bias', 'deberta.encoder.layer.7.attention.self.pos_key_proj.weight', 'deberta.encoder.layer.7.attention.self.pos_key_proj.bias', 'deberta.encoder.layer.7.attention.self.pos_query_proj.weight', 'deberta.encoder.layer.7.attention.self.pos_query_proj.bias', 'deberta.encoder.layer.8.attention.self.query_proj.weight', 'deberta.encoder.layer.8.attention.self.query_proj.bias', 'deberta.encoder.layer.8.attention.self.key_proj.weight', 'deberta.encoder.layer.8.attention.self.key_proj.bias', 'deberta.encoder.layer.8.attention.self.value_proj.weight', 'deberta.encoder.layer.8.attention.self.value_proj.bias', 'deberta.encoder.layer.8.attention.self.pos_key_proj.weight', 'deberta.encoder.layer.8.attention.self.pos_key_proj.bias', 'deberta.encoder.layer.8.attention.self.pos_query_proj.weight', 'deberta.encoder.layer.8.attention.self.pos_query_proj.bias', 'deberta.encoder.layer.9.attention.self.query_proj.weight', 'deberta.encoder.layer.9.attention.self.query_proj.bias', 'deberta.encoder.layer.9.attention.self.key_proj.weight', 'deberta.encoder.layer.9.attention.self.key_proj.bias', 'deberta.encoder.layer.9.attention.self.value_proj.weight', 'deberta.encoder.layer.9.attention.self.value_proj.bias', 'deberta.encoder.layer.9.attention.self.pos_key_proj.weight', 'deberta.encoder.layer.9.attention.self.pos_key_proj.bias', 'deberta.encoder.layer.9.attention.self.pos_query_proj.weight', 'deberta.encoder.layer.9.attention.self.pos_query_proj.bias', 'deberta.encoder.layer.10.attention.self.query_proj.weight', 'deberta.encoder.layer.10.attention.self.query_proj.bias', 'deberta.encoder.layer.10.attention.self.key_proj.weight', 'deberta.encoder.layer.10.attention.self.key_proj.bias', 'deberta.encoder.layer.10.attention.self.value_proj.weight', 'deberta.encoder.layer.10.attention.self.value_proj.bias', 'deberta.encoder.layer.10.attention.self.pos_key_proj.weight', 'deberta.encoder.layer.10.attention.self.pos_key_proj.bias', 'deberta.encoder.layer.10.attention.self.pos_query_proj.weight', 'deberta.encoder.layer.10.attention.self.pos_query_proj.bias', 'deberta.encoder.layer.11.attention.self.query_proj.weight', 'deberta.encoder.layer.11.attention.self.query_proj.bias', 'deberta.encoder.layer.11.attention.self.key_proj.weight', 'deberta.encoder.layer.11.attention.self.key_proj.bias', 'deberta.encoder.layer.11.attention.self.value_proj.weight', 'deberta.encoder.layer.11.attention.self.value_proj.bias', 'deberta.encoder.layer.11.attention.self.pos_key_proj.weight', 'deberta.encoder.layer.11.attention.self.pos_key_proj.bias', 'deberta.encoder.layer.11.attention.self.pos_query_proj.weight', 'deberta.encoder.layer.11.attention.self.pos_query_proj.bias', 'deberta.encoder.layer.12.attention.self.query_proj.weight', 'deberta.encoder.layer.12.attention.self.query_proj.bias', 'deberta.encoder.layer.12.attention.self.key_proj.weight', 'deberta.encoder.layer.12.attention.self.key_proj.bias', 'deberta.encoder.layer.12.attention.self.value_proj.weight', 'deberta.encoder.layer.12.attention.self.value_proj.bias', 'deberta.encoder.layer.12.attention.self.pos_key_proj.weight', 'deberta.encoder.layer.12.attention.self.pos_key_proj.bias', 'deberta.encoder.layer.12.attention.self.pos_query_proj.weight', 'deberta.encoder.layer.12.attention.self.pos_query_proj.bias', 'deberta.encoder.layer.13.attention.self.query_proj.weight', 'deberta.encoder.layer.13.attention.self.query_proj.bias', 'deberta.encoder.layer.13.attention.self.key_proj.weight', 'deberta.encoder.layer.13.attention.self.key_proj.bias', 'deberta.encoder.layer.13.attention.self.value_proj.weight', 'deberta.encoder.layer.13.attention.self.value_proj.bias', 'deberta.encoder.layer.13.attention.self.pos_key_proj.weight', 'deberta.encoder.layer.13.attention.self.pos_key_proj.bias', 'deberta.encoder.layer.13.attention.self.pos_query_proj.weight', 'deberta.encoder.layer.13.attention.self.pos_query_proj.bias', 'deberta.encoder.layer.14.attention.self.query_proj.weight', 'deberta.encoder.layer.14.attention.self.query_proj.bias', 'deberta.encoder.layer.14.attention.self.key_proj.weight', 'deberta.encoder.layer.14.attention.self.key_proj.bias', 'deberta.encoder.layer.14.attention.self.value_proj.weight', 'deberta.encoder.layer.14.attention.self.value_proj.bias', 'deberta.encoder.layer.14.attention.self.pos_key_proj.weight', 'deberta.encoder.layer.14.attention.self.pos_key_proj.bias', 'deberta.encoder.layer.14.attention.self.pos_query_proj.weight', 'deberta.encoder.layer.14.attention.self.pos_query_proj.bias', 'deberta.encoder.layer.15.attention.self.query_proj.weight', 'deberta.encoder.layer.15.attention.self.query_proj.bias', 'deberta.encoder.layer.15.attention.self.key_proj.weight', 'deberta.encoder.layer.15.attention.self.key_proj.bias', 'deberta.encoder.layer.15.attention.self.value_proj.weight', 'deberta.encoder.layer.15.attention.self.value_proj.bias', 'deberta.encoder.layer.15.attention.self.pos_key_proj.weight', 'deberta.encoder.layer.15.attention.self.pos_key_proj.bias', 'deberta.encoder.layer.15.attention.self.pos_query_proj.weight', 'deberta.encoder.layer.15.attention.self.pos_query_proj.bias', 'deberta.encoder.layer.16.attention.self.query_proj.weight', 'deberta.encoder.layer.16.attention.self.query_proj.bias', 'deberta.encoder.layer.16.attention.self.key_proj.weight', 'deberta.encoder.layer.16.attention.self.key_proj.bias', 'deberta.encoder.layer.16.attention.self.value_proj.weight', 'deberta.encoder.layer.16.attention.self.value_proj.bias', 'deberta.encoder.layer.16.attention.self.pos_key_proj.weight', 'deberta.encoder.layer.16.attention.self.pos_key_proj.bias', 'deberta.encoder.layer.16.attention.self.pos_query_proj.weight', 'deberta.encoder.layer.16.attention.self.pos_query_proj.bias', 'deberta.encoder.layer.17.attention.self.query_proj.weight', 'deberta.encoder.layer.17.attention.self.query_proj.bias', 'deberta.encoder.layer.17.attention.self.key_proj.weight', 'deberta.encoder.layer.17.attention.self.key_proj.bias', 'deberta.encoder.layer.17.attention.self.value_proj.weight', 'deberta.encoder.layer.17.attention.self.value_proj.bias', 'deberta.encoder.layer.17.attention.self.pos_key_proj.weight', 'deberta.encoder.layer.17.attention.self.pos_key_proj.bias', 'deberta.encoder.layer.17.attention.self.pos_query_proj.weight', 'deberta.encoder.layer.17.attention.self.pos_query_proj.bias', 'deberta.encoder.layer.18.attention.self.query_proj.weight', 'deberta.encoder.layer.18.attention.self.query_proj.bias', 'deberta.encoder.layer.18.attention.self.key_proj.weight', 'deberta.encoder.layer.18.attention.self.key_proj.bias', 'deberta.encoder.layer.18.attention.self.value_proj.weight', 'deberta.encoder.layer.18.attention.self.value_proj.bias', 'deberta.encoder.layer.18.attention.self.pos_key_proj.weight', 'deberta.encoder.layer.18.attention.self.pos_key_proj.bias', 'deberta.encoder.layer.18.attention.self.pos_query_proj.weight', 'deberta.encoder.layer.18.attention.self.pos_query_proj.bias', 'deberta.encoder.layer.19.attention.self.query_proj.weight', 'deberta.encoder.layer.19.attention.self.query_proj.bias', 'deberta.encoder.layer.19.attention.self.key_proj.weight', 'deberta.encoder.layer.19.attention.self.key_proj.bias', 'deberta.encoder.layer.19.attention.self.value_proj.weight', 'deberta.encoder.layer.19.attention.self.value_proj.bias', 'deberta.encoder.layer.19.attention.self.pos_key_proj.weight', 'deberta.encoder.layer.19.attention.self.pos_key_proj.bias', 'deberta.encoder.layer.19.attention.self.pos_query_proj.weight', 'deberta.encoder.layer.19.attention.self.pos_query_proj.bias', 'deberta.encoder.layer.20.attention.self.query_proj.weight', 'deberta.encoder.layer.20.attention.self.query_proj.bias', 'deberta.encoder.layer.20.attention.self.key_proj.weight', 'deberta.encoder.layer.20.attention.self.key_proj.bias', 'deberta.encoder.layer.20.attention.self.value_proj.weight', 'deberta.encoder.layer.20.attention.self.value_proj.bias', 'deberta.encoder.layer.20.attention.self.pos_key_proj.weight', 'deberta.encoder.layer.20.attention.self.pos_key_proj.bias', 'deberta.encoder.layer.20.attention.self.pos_query_proj.weight', 'deberta.encoder.layer.20.attention.self.pos_query_proj.bias', 'deberta.encoder.layer.21.attention.self.query_proj.weight', 'deberta.encoder.layer.21.attention.self.query_proj.bias', 'deberta.encoder.layer.21.attention.self.key_proj.weight', 'deberta.encoder.layer.21.attention.self.key_proj.bias', 'deberta.encoder.layer.21.attention.self.value_proj.weight', 'deberta.encoder.layer.21.attention.self.value_proj.bias', 'deberta.encoder.layer.21.attention.self.pos_key_proj.weight', 'deberta.encoder.layer.21.attention.self.pos_key_proj.bias', 'deberta.encoder.layer.21.attention.self.pos_query_proj.weight', 'deberta.encoder.layer.21.attention.self.pos_query_proj.bias', 'deberta.encoder.layer.22.attention.self.query_proj.weight', 'deberta.encoder.layer.22.attention.self.query_proj.bias', 'deberta.encoder.layer.22.attention.self.key_proj.weight', 'deberta.encoder.layer.22.attention.self.key_proj.bias', 'deberta.encoder.layer.22.attention.self.value_proj.weight', 'deberta.encoder.layer.22.attention.self.value_proj.bias', 'deberta.encoder.layer.22.attention.self.pos_key_proj.weight', 'deberta.encoder.layer.22.attention.self.pos_key_proj.bias', 'deberta.encoder.layer.22.attention.self.pos_query_proj.weight', 'deberta.encoder.layer.22.attention.self.pos_query_proj.bias', 'deberta.encoder.layer.23.attention.self.query_proj.weight', 'deberta.encoder.layer.23.attention.self.query_proj.bias', 'deberta.encoder.layer.23.attention.self.key_proj.weight', 'deberta.encoder.layer.23.attention.self.key_proj.bias', 'deberta.encoder.layer.23.attention.self.value_proj.weight', 'deberta.encoder.layer.23.attention.self.value_proj.bias', 'deberta.encoder.layer.23.attention.self.pos_key_proj.weight', 'deberta.encoder.layer.23.attention.self.pos_key_proj.bias', 'deberta.encoder.layer.23.attention.self.pos_query_proj.weight', 'deberta.encoder.layer.23.attention.self.pos_query_proj.bias', 'deberta.encoder.layer.24.attention.self.query_proj.weight', 'deberta.encoder.layer.24.attention.self.query_proj.bias', 'deberta.encoder.layer.24.attention.self.key_proj.weight', 'deberta.encoder.layer.24.attention.self.key_proj.bias', 'deberta.encoder.layer.24.attention.self.value_proj.weight', 'deberta.encoder.layer.24.attention.self.value_proj.bias', 'deberta.encoder.layer.24.attention.self.pos_key_proj.weight', 'deberta.encoder.layer.24.attention.self.pos_key_proj.bias', 'deberta.encoder.layer.24.attention.self.pos_query_proj.weight', 'deberta.encoder.layer.24.attention.self.pos_query_proj.bias', 'deberta.encoder.layer.25.attention.self.query_proj.weight', 'deberta.encoder.layer.25.attention.self.query_proj.bias', 'deberta.encoder.layer.25.attention.self.key_proj.weight', 'deberta.encoder.layer.25.attention.self.key_proj.bias', 'deberta.encoder.layer.25.attention.self.value_proj.weight', 'deberta.encoder.layer.25.attention.self.value_proj.bias', 'deberta.encoder.layer.25.attention.self.pos_key_proj.weight', 'deberta.encoder.layer.25.attention.self.pos_key_proj.bias', 'deberta.encoder.layer.25.attention.self.pos_query_proj.weight', 'deberta.encoder.layer.25.attention.self.pos_query_proj.bias', 'deberta.encoder.layer.26.attention.self.query_proj.weight', 'deberta.encoder.layer.26.attention.self.query_proj.bias', 'deberta.encoder.layer.26.attention.self.key_proj.weight', 'deberta.encoder.layer.26.attention.self.key_proj.bias', 'deberta.encoder.layer.26.attention.self.value_proj.weight', 'deberta.encoder.layer.26.attention.self.value_proj.bias', 'deberta.encoder.layer.26.attention.self.pos_key_proj.weight', 'deberta.encoder.layer.26.attention.self.pos_key_proj.bias', 'deberta.encoder.layer.26.attention.self.pos_query_proj.weight', 'deberta.encoder.layer.26.attention.self.pos_query_proj.bias', 'deberta.encoder.layer.27.attention.self.query_proj.weight', 'deberta.encoder.layer.27.attention.self.query_proj.bias', 'deberta.encoder.layer.27.attention.self.key_proj.weight', 'deberta.encoder.layer.27.attention.self.key_proj.bias', 'deberta.encoder.layer.27.attention.self.value_proj.weight', 'deberta.encoder.layer.27.attention.self.value_proj.bias', 'deberta.encoder.layer.27.attention.self.pos_key_proj.weight', 'deberta.encoder.layer.27.attention.self.pos_key_proj.bias', 'deberta.encoder.layer.27.attention.self.pos_query_proj.weight', 'deberta.encoder.layer.27.attention.self.pos_query_proj.bias', 'deberta.encoder.layer.28.attention.self.query_proj.weight', 'deberta.encoder.layer.28.attention.self.query_proj.bias', 'deberta.encoder.layer.28.attention.self.key_proj.weight', 'deberta.encoder.layer.28.attention.self.key_proj.bias', 'deberta.encoder.layer.28.attention.self.value_proj.weight', 'deberta.encoder.layer.28.attention.self.value_proj.bias', 'deberta.encoder.layer.28.attention.self.pos_key_proj.weight', 'deberta.encoder.layer.28.attention.self.pos_key_proj.bias', 'deberta.encoder.layer.28.attention.self.pos_query_proj.weight', 'deberta.encoder.layer.28.attention.self.pos_query_proj.bias', 'deberta.encoder.layer.29.attention.self.query_proj.weight', 'deberta.encoder.layer.29.attention.self.query_proj.bias', 'deberta.encoder.layer.29.attention.self.key_proj.weight', 'deberta.encoder.layer.29.attention.self.key_proj.bias', 'deberta.encoder.layer.29.attention.self.value_proj.weight', 'deberta.encoder.layer.29.attention.self.value_proj.bias', 'deberta.encoder.layer.29.attention.self.pos_key_proj.weight', 'deberta.encoder.layer.29.attention.self.pos_key_proj.bias', 'deberta.encoder.layer.29.attention.self.pos_query_proj.weight', 'deberta.encoder.layer.29.attention.self.pos_query_proj.bias', 'deberta.encoder.layer.30.attention.self.query_proj.weight', 'deberta.encoder.layer.30.attention.self.query_proj.bias', 'deberta.encoder.layer.30.attention.self.key_proj.weight', 'deberta.encoder.layer.30.attention.self.key_proj.bias', 'deberta.encoder.layer.30.attention.self.value_proj.weight', 'deberta.encoder.layer.30.attention.self.value_proj.bias', 'deberta.encoder.layer.30.attention.self.pos_key_proj.weight', 'deberta.encoder.layer.30.attention.self.pos_key_proj.bias', 'deberta.encoder.layer.30.attention.self.pos_query_proj.weight', 'deberta.encoder.layer.30.attention.self.pos_query_proj.bias', 'deberta.encoder.layer.31.attention.self.query_proj.weight', 'deberta.encoder.layer.31.attention.self.query_proj.bias', 'deberta.encoder.layer.31.attention.self.key_proj.weight', 'deberta.encoder.layer.31.attention.self.key_proj.bias', 'deberta.encoder.layer.31.attention.self.value_proj.weight', 'deberta.encoder.layer.31.attention.self.value_proj.bias', 'deberta.encoder.layer.31.attention.self.pos_key_proj.weight', 'deberta.encoder.layer.31.attention.self.pos_key_proj.bias', 'deberta.encoder.layer.31.attention.self.pos_query_proj.weight', 'deberta.encoder.layer.31.attention.self.pos_query_proj.bias', 'deberta.encoder.layer.32.attention.self.query_proj.weight', 'deberta.encoder.layer.32.attention.self.query_proj.bias', 'deberta.encoder.layer.32.attention.self.key_proj.weight', 'deberta.encoder.layer.32.attention.self.key_proj.bias', 'deberta.encoder.layer.32.attention.self.value_proj.weight', 'deberta.encoder.layer.32.attention.self.value_proj.bias', 'deberta.encoder.layer.32.attention.self.pos_key_proj.weight', 'deberta.encoder.layer.32.attention.self.pos_key_proj.bias', 'deberta.encoder.layer.32.attention.self.pos_query_proj.weight', 'deberta.encoder.layer.32.attention.self.pos_query_proj.bias', 'deberta.encoder.layer.33.attention.self.query_proj.weight', 'deberta.encoder.layer.33.attention.self.query_proj.bias', 'deberta.encoder.layer.33.attention.self.key_proj.weight', 'deberta.encoder.layer.33.attention.self.key_proj.bias', 'deberta.encoder.layer.33.attention.self.value_proj.weight', 'deberta.encoder.layer.33.attention.self.value_proj.bias', 'deberta.encoder.layer.33.attention.self.pos_key_proj.weight', 'deberta.encoder.layer.33.attention.self.pos_key_proj.bias', 'deberta.encoder.layer.33.attention.self.pos_query_proj.weight', 'deberta.encoder.layer.33.attention.self.pos_query_proj.bias', 'deberta.encoder.layer.34.attention.self.query_proj.weight', 'deberta.encoder.layer.34.attention.self.query_proj.bias', 'deberta.encoder.layer.34.attention.self.key_proj.weight', 'deberta.encoder.layer.34.attention.self.key_proj.bias', 'deberta.encoder.layer.34.attention.self.value_proj.weight', 'deberta.encoder.layer.34.attention.self.value_proj.bias', 'deberta.encoder.layer.34.attention.self.pos_key_proj.weight', 'deberta.encoder.layer.34.attention.self.pos_key_proj.bias', 'deberta.encoder.layer.34.attention.self.pos_query_proj.weight', 'deberta.encoder.layer.34.attention.self.pos_query_proj.bias', 'deberta.encoder.layer.35.attention.self.query_proj.weight', 'deberta.encoder.layer.35.attention.self.query_proj.bias', 'deberta.encoder.layer.35.attention.self.key_proj.weight', 'deberta.encoder.layer.35.attention.self.key_proj.bias', 'deberta.encoder.layer.35.attention.self.value_proj.weight', 'deberta.encoder.layer.35.attention.self.value_proj.bias', 'deberta.encoder.layer.35.attention.self.pos_key_proj.weight', 'deberta.encoder.layer.35.attention.self.pos_key_proj.bias', 'deberta.encoder.layer.35.attention.self.pos_query_proj.weight', 'deberta.encoder.layer.35.attention.self.pos_query_proj.bias', 'deberta.encoder.layer.36.attention.self.query_proj.weight', 'deberta.encoder.layer.36.attention.self.query_proj.bias', 'deberta.encoder.layer.36.attention.self.key_proj.weight', 'deberta.encoder.layer.36.attention.self.key_proj.bias', 'deberta.encoder.layer.36.attention.self.value_proj.weight', 'deberta.encoder.layer.36.attention.self.value_proj.bias', 'deberta.encoder.layer.36.attention.self.pos_key_proj.weight', 'deberta.encoder.layer.36.attention.self.pos_key_proj.bias', 'deberta.encoder.layer.36.attention.self.pos_query_proj.weight', 'deberta.encoder.layer.36.attention.self.pos_query_proj.bias', 'deberta.encoder.layer.37.attention.self.query_proj.weight', 'deberta.encoder.layer.37.attention.self.query_proj.bias', 'deberta.encoder.layer.37.attention.self.key_proj.weight', 'deberta.encoder.layer.37.attention.self.key_proj.bias', 'deberta.encoder.layer.37.attention.self.value_proj.weight', 'deberta.encoder.layer.37.attention.self.value_proj.bias', 'deberta.encoder.layer.37.attention.self.pos_key_proj.weight', 'deberta.encoder.layer.37.attention.self.pos_key_proj.bias', 'deberta.encoder.layer.37.attention.self.pos_query_proj.weight', 'deberta.encoder.layer.37.attention.self.pos_query_proj.bias', 'deberta.encoder.layer.38.attention.self.query_proj.weight', 'deberta.encoder.layer.38.attention.self.query_proj.bias', 'deberta.encoder.layer.38.attention.self.key_proj.weight', 'deberta.encoder.layer.38.attention.self.key_proj.bias', 'deberta.encoder.layer.38.attention.self.value_proj.weight', 'deberta.encoder.layer.38.attention.self.value_proj.bias', 'deberta.encoder.layer.38.attention.self.pos_key_proj.weight', 'deberta.encoder.layer.38.attention.self.pos_key_proj.bias', 'deberta.encoder.layer.38.attention.self.pos_query_proj.weight', 'deberta.encoder.layer.38.attention.self.pos_query_proj.bias', 'deberta.encoder.layer.39.attention.self.query_proj.weight', 'deberta.encoder.layer.39.attention.self.query_proj.bias', 'deberta.encoder.layer.39.attention.self.key_proj.weight', 'deberta.encoder.layer.39.attention.self.key_proj.bias', 'deberta.encoder.layer.39.attention.self.value_proj.weight', 'deberta.encoder.layer.39.attention.self.value_proj.bias', 'deberta.encoder.layer.39.attention.self.pos_key_proj.weight', 'deberta.encoder.layer.39.attention.self.pos_key_proj.bias', 'deberta.encoder.layer.39.attention.self.pos_query_proj.weight', 'deberta.encoder.layer.39.attention.self.pos_query_proj.bias', 'deberta.encoder.layer.40.attention.self.query_proj.weight', 'deberta.encoder.layer.40.attention.self.query_proj.bias', 'deberta.encoder.layer.40.attention.self.key_proj.weight', 'deberta.encoder.layer.40.attention.self.key_proj.bias', 'deberta.encoder.layer.40.attention.self.value_proj.weight', 'deberta.encoder.layer.40.attention.self.value_proj.bias', 'deberta.encoder.layer.40.attention.self.pos_key_proj.weight', 'deberta.encoder.layer.40.attention.self.pos_key_proj.bias', 'deberta.encoder.layer.40.attention.self.pos_query_proj.weight', 'deberta.encoder.layer.40.attention.self.pos_query_proj.bias', 'deberta.encoder.layer.41.attention.self.query_proj.weight', 'deberta.encoder.layer.41.attention.self.query_proj.bias', 'deberta.encoder.layer.41.attention.self.key_proj.weight', 'deberta.encoder.layer.41.attention.self.key_proj.bias', 'deberta.encoder.layer.41.attention.self.value_proj.weight', 'deberta.encoder.layer.41.attention.self.value_proj.bias', 'deberta.encoder.layer.41.attention.self.pos_key_proj.weight', 'deberta.encoder.layer.41.attention.self.pos_key_proj.bias', 'deberta.encoder.layer.41.attention.self.pos_query_proj.weight', 'deberta.encoder.layer.41.attention.self.pos_query_proj.bias', 'deberta.encoder.layer.42.attention.self.query_proj.weight', 'deberta.encoder.layer.42.attention.self.query_proj.bias', 'deberta.encoder.layer.42.attention.self.key_proj.weight', 'deberta.encoder.layer.42.attention.self.key_proj.bias', 'deberta.encoder.layer.42.attention.self.value_proj.weight', 'deberta.encoder.layer.42.attention.self.value_proj.bias', 'deberta.encoder.layer.42.attention.self.pos_key_proj.weight', 'deberta.encoder.layer.42.attention.self.pos_key_proj.bias', 'deberta.encoder.layer.42.attention.self.pos_query_proj.weight', 'deberta.encoder.layer.42.attention.self.pos_query_proj.bias', 'deberta.encoder.layer.43.attention.self.query_proj.weight', 'deberta.encoder.layer.43.attention.self.query_proj.bias', 'deberta.encoder.layer.43.attention.self.key_proj.weight', 'deberta.encoder.layer.43.attention.self.key_proj.bias', 'deberta.encoder.layer.43.attention.self.value_proj.weight', 'deberta.encoder.layer.43.attention.self.value_proj.bias', 'deberta.encoder.layer.43.attention.self.pos_key_proj.weight', 'deberta.encoder.layer.43.attention.self.pos_key_proj.bias', 'deberta.encoder.layer.43.attention.self.pos_query_proj.weight', 'deberta.encoder.layer.43.attention.self.pos_query_proj.bias', 'deberta.encoder.layer.44.attention.self.query_proj.weight', 'deberta.encoder.layer.44.attention.self.query_proj.bias', 'deberta.encoder.layer.44.attention.self.key_proj.weight', 'deberta.encoder.layer.44.attention.self.key_proj.bias', 'deberta.encoder.layer.44.attention.self.value_proj.weight', 'deberta.encoder.layer.44.attention.self.value_proj.bias', 'deberta.encoder.layer.44.attention.self.pos_key_proj.weight', 'deberta.encoder.layer.44.attention.self.pos_key_proj.bias', 'deberta.encoder.layer.44.attention.self.pos_query_proj.weight', 'deberta.encoder.layer.44.attention.self.pos_query_proj.bias', 'deberta.encoder.layer.45.attention.self.query_proj.weight', 'deberta.encoder.layer.45.attention.self.query_proj.bias', 'deberta.encoder.layer.45.attention.self.key_proj.weight', 'deberta.encoder.layer.45.attention.self.key_proj.bias', 'deberta.encoder.layer.45.attention.self.value_proj.weight', 'deberta.encoder.layer.45.attention.self.value_proj.bias', 'deberta.encoder.layer.45.attention.self.pos_key_proj.weight', 'deberta.encoder.layer.45.attention.self.pos_key_proj.bias', 'deberta.encoder.layer.45.attention.self.pos_query_proj.weight', 'deberta.encoder.layer.45.attention.self.pos_query_proj.bias', 'deberta.encoder.layer.46.attention.self.query_proj.weight', 'deberta.encoder.layer.46.attention.self.query_proj.bias', 'deberta.encoder.layer.46.attention.self.key_proj.weight', 'deberta.encoder.layer.46.attention.self.key_proj.bias', 'deberta.encoder.layer.46.attention.self.value_proj.weight', 'deberta.encoder.layer.46.attention.self.value_proj.bias', 'deberta.encoder.layer.46.attention.self.pos_key_proj.weight', 'deberta.encoder.layer.46.attention.self.pos_key_proj.bias', 'deberta.encoder.layer.46.attention.self.pos_query_proj.weight', 'deberta.encoder.layer.46.attention.self.pos_query_proj.bias', 'deberta.encoder.layer.47.attention.self.query_proj.weight', 'deberta.encoder.layer.47.attention.self.query_proj.bias', 'deberta.encoder.layer.47.attention.self.key_proj.weight', 'deberta.encoder.layer.47.attention.self.key_proj.bias', 'deberta.encoder.layer.47.attention.self.value_proj.weight', 'deberta.encoder.layer.47.attention.self.value_proj.bias', 'deberta.encoder.layer.47.attention.self.pos_key_proj.weight', 'deberta.encoder.layer.47.attention.self.pos_key_proj.bias', 'deberta.encoder.layer.47.attention.self.pos_query_proj.weight', 'deberta.encoder.layer.47.attention.self.pos_query_proj.bias']
- This IS expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DebertaForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-xlarge-mnli and are newly initialized: ['deberta.encoder.layer.0.attention.self.q_bias', 'deberta.encoder.layer.0.attention.self.v_bias', 'deberta.encoder.layer.0.attention.self.in_proj.weight', 'deberta.encoder.layer.0.attention.self.pos_proj.weight', 'deberta.encoder.layer.0.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.0.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.1.attention.self.q_bias', 'deberta.encoder.layer.1.attention.self.v_bias', 'deberta.encoder.layer.1.attention.self.in_proj.weight', 'deberta.encoder.layer.1.attention.self.pos_proj.weight', 'deberta.encoder.layer.1.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.1.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.2.attention.self.q_bias', 'deberta.encoder.layer.2.attention.self.v_bias', 'deberta.encoder.layer.2.attention.self.in_proj.weight', 'deberta.encoder.layer.2.attention.self.pos_proj.weight', 'deberta.encoder.layer.2.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.2.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.3.attention.self.q_bias', 'deberta.encoder.layer.3.attention.self.v_bias', 'deberta.encoder.layer.3.attention.self.in_proj.weight', 'deberta.encoder.layer.3.attention.self.pos_proj.weight', 'deberta.encoder.layer.3.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.3.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.4.attention.self.q_bias', 'deberta.encoder.layer.4.attention.self.v_bias', 'deberta.encoder.layer.4.attention.self.in_proj.weight', 'deberta.encoder.layer.4.attention.self.pos_proj.weight', 'deberta.encoder.layer.4.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.4.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.5.attention.self.q_bias', 'deberta.encoder.layer.5.attention.self.v_bias', 'deberta.encoder.layer.5.attention.self.in_proj.weight', 'deberta.encoder.layer.5.attention.self.pos_proj.weight', 'deberta.encoder.layer.5.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.5.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.6.attention.self.q_bias', 'deberta.encoder.layer.6.attention.self.v_bias', 'deberta.encoder.layer.6.attention.self.in_proj.weight', 'deberta.encoder.layer.6.attention.self.pos_proj.weight', 'deberta.encoder.layer.6.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.6.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.7.attention.self.q_bias', 'deberta.encoder.layer.7.attention.self.v_bias', 'deberta.encoder.layer.7.attention.self.in_proj.weight', 'deberta.encoder.layer.7.attention.self.pos_proj.weight', 'deberta.encoder.layer.7.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.7.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.8.attention.self.q_bias', 'deberta.encoder.layer.8.attention.self.v_bias', 'deberta.encoder.layer.8.attention.self.in_proj.weight', 'deberta.encoder.layer.8.attention.self.pos_proj.weight', 'deberta.encoder.layer.8.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.8.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.9.attention.self.q_bias', 'deberta.encoder.layer.9.attention.self.v_bias', 'deberta.encoder.layer.9.attention.self.in_proj.weight', 'deberta.encoder.layer.9.attention.self.pos_proj.weight', 'deberta.encoder.layer.9.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.9.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.10.attention.self.q_bias', 'deberta.encoder.layer.10.attention.self.v_bias', 'deberta.encoder.layer.10.attention.self.in_proj.weight', 'deberta.encoder.layer.10.attention.self.pos_proj.weight', 'deberta.encoder.layer.10.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.10.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.11.attention.self.q_bias', 'deberta.encoder.layer.11.attention.self.v_bias', 'deberta.encoder.layer.11.attention.self.in_proj.weight', 'deberta.encoder.layer.11.attention.self.pos_proj.weight', 'deberta.encoder.layer.11.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.11.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.12.attention.self.q_bias', 'deberta.encoder.layer.12.attention.self.v_bias', 'deberta.encoder.layer.12.attention.self.in_proj.weight', 'deberta.encoder.layer.12.attention.self.pos_proj.weight', 'deberta.encoder.layer.12.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.12.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.13.attention.self.q_bias', 'deberta.encoder.layer.13.attention.self.v_bias', 'deberta.encoder.layer.13.attention.self.in_proj.weight', 'deberta.encoder.layer.13.attention.self.pos_proj.weight', 'deberta.encoder.layer.13.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.13.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.14.attention.self.q_bias', 'deberta.encoder.layer.14.attention.self.v_bias', 'deberta.encoder.layer.14.attention.self.in_proj.weight', 'deberta.encoder.layer.14.attention.self.pos_proj.weight', 'deberta.encoder.layer.14.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.14.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.15.attention.self.q_bias', 'deberta.encoder.layer.15.attention.self.v_bias', 'deberta.encoder.layer.15.attention.self.in_proj.weight', 'deberta.encoder.layer.15.attention.self.pos_proj.weight', 'deberta.encoder.layer.15.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.15.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.16.attention.self.q_bias', 'deberta.encoder.layer.16.attention.self.v_bias', 'deberta.encoder.layer.16.attention.self.in_proj.weight', 'deberta.encoder.layer.16.attention.self.pos_proj.weight', 'deberta.encoder.layer.16.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.16.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.17.attention.self.q_bias', 'deberta.encoder.layer.17.attention.self.v_bias', 'deberta.encoder.layer.17.attention.self.in_proj.weight', 'deberta.encoder.layer.17.attention.self.pos_proj.weight', 'deberta.encoder.layer.17.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.17.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.18.attention.self.q_bias', 'deberta.encoder.layer.18.attention.self.v_bias', 'deberta.encoder.layer.18.attention.self.in_proj.weight', 'deberta.encoder.layer.18.attention.self.pos_proj.weight', 'deberta.encoder.layer.18.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.18.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.19.attention.self.q_bias', 'deberta.encoder.layer.19.attention.self.v_bias', 'deberta.encoder.layer.19.attention.self.in_proj.weight', 'deberta.encoder.layer.19.attention.self.pos_proj.weight', 'deberta.encoder.layer.19.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.19.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.20.attention.self.q_bias', 'deberta.encoder.layer.20.attention.self.v_bias', 'deberta.encoder.layer.20.attention.self.in_proj.weight', 'deberta.encoder.layer.20.attention.self.pos_proj.weight', 'deberta.encoder.layer.20.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.20.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.21.attention.self.q_bias', 'deberta.encoder.layer.21.attention.self.v_bias', 'deberta.encoder.layer.21.attention.self.in_proj.weight', 'deberta.encoder.layer.21.attention.self.pos_proj.weight', 'deberta.encoder.layer.21.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.21.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.22.attention.self.q_bias', 'deberta.encoder.layer.22.attention.self.v_bias', 'deberta.encoder.layer.22.attention.self.in_proj.weight', 'deberta.encoder.layer.22.attention.self.pos_proj.weight', 'deberta.encoder.layer.22.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.22.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.23.attention.self.q_bias', 'deberta.encoder.layer.23.attention.self.v_bias', 'deberta.encoder.layer.23.attention.self.in_proj.weight', 'deberta.encoder.layer.23.attention.self.pos_proj.weight', 'deberta.encoder.layer.23.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.23.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.24.attention.self.q_bias', 'deberta.encoder.layer.24.attention.self.v_bias', 'deberta.encoder.layer.24.attention.self.in_proj.weight', 'deberta.encoder.layer.24.attention.self.pos_proj.weight', 'deberta.encoder.layer.24.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.24.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.25.attention.self.q_bias', 'deberta.encoder.layer.25.attention.self.v_bias', 'deberta.encoder.layer.25.attention.self.in_proj.weight', 'deberta.encoder.layer.25.attention.self.pos_proj.weight', 'deberta.encoder.layer.25.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.25.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.26.attention.self.q_bias', 'deberta.encoder.layer.26.attention.self.v_bias', 'deberta.encoder.layer.26.attention.self.in_proj.weight', 'deberta.encoder.layer.26.attention.self.pos_proj.weight', 'deberta.encoder.layer.26.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.26.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.27.attention.self.q_bias', 'deberta.encoder.layer.27.attention.self.v_bias', 'deberta.encoder.layer.27.attention.self.in_proj.weight', 'deberta.encoder.layer.27.attention.self.pos_proj.weight', 'deberta.encoder.layer.27.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.27.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.28.attention.self.q_bias', 'deberta.encoder.layer.28.attention.self.v_bias', 'deberta.encoder.layer.28.attention.self.in_proj.weight', 'deberta.encoder.layer.28.attention.self.pos_proj.weight', 'deberta.encoder.layer.28.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.28.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.29.attention.self.q_bias', 'deberta.encoder.layer.29.attention.self.v_bias', 'deberta.encoder.layer.29.attention.self.in_proj.weight', 'deberta.encoder.layer.29.attention.self.pos_proj.weight', 'deberta.encoder.layer.29.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.29.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.30.attention.self.q_bias', 'deberta.encoder.layer.30.attention.self.v_bias', 'deberta.encoder.layer.30.attention.self.in_proj.weight', 'deberta.encoder.layer.30.attention.self.pos_proj.weight', 'deberta.encoder.layer.30.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.30.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.31.attention.self.q_bias', 'deberta.encoder.layer.31.attention.self.v_bias', 'deberta.encoder.layer.31.attention.self.in_proj.weight', 'deberta.encoder.layer.31.attention.self.pos_proj.weight', 'deberta.encoder.layer.31.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.31.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.32.attention.self.q_bias', 'deberta.encoder.layer.32.attention.self.v_bias', 'deberta.encoder.layer.32.attention.self.in_proj.weight', 'deberta.encoder.layer.32.attention.self.pos_proj.weight', 'deberta.encoder.layer.32.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.32.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.33.attention.self.q_bias', 'deberta.encoder.layer.33.attention.self.v_bias', 'deberta.encoder.layer.33.attention.self.in_proj.weight', 'deberta.encoder.layer.33.attention.self.pos_proj.weight', 'deberta.encoder.layer.33.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.33.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.34.attention.self.q_bias', 'deberta.encoder.layer.34.attention.self.v_bias', 'deberta.encoder.layer.34.attention.self.in_proj.weight', 'deberta.encoder.layer.34.attention.self.pos_proj.weight', 'deberta.encoder.layer.34.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.34.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.35.attention.self.q_bias', 'deberta.encoder.layer.35.attention.self.v_bias', 'deberta.encoder.layer.35.attention.self.in_proj.weight', 'deberta.encoder.layer.35.attention.self.pos_proj.weight', 'deberta.encoder.layer.35.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.35.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.36.attention.self.q_bias', 'deberta.encoder.layer.36.attention.self.v_bias', 'deberta.encoder.layer.36.attention.self.in_proj.weight', 'deberta.encoder.layer.36.attention.self.pos_proj.weight', 'deberta.encoder.layer.36.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.36.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.37.attention.self.q_bias', 'deberta.encoder.layer.37.attention.self.v_bias', 'deberta.encoder.layer.37.attention.self.in_proj.weight', 'deberta.encoder.layer.37.attention.self.pos_proj.weight', 'deberta.encoder.layer.37.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.37.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.38.attention.self.q_bias', 'deberta.encoder.layer.38.attention.self.v_bias', 'deberta.encoder.layer.38.attention.self.in_proj.weight', 'deberta.encoder.layer.38.attention.self.pos_proj.weight', 'deberta.encoder.layer.38.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.38.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.39.attention.self.q_bias', 'deberta.encoder.layer.39.attention.self.v_bias', 'deberta.encoder.layer.39.attention.self.in_proj.weight', 'deberta.encoder.layer.39.attention.self.pos_proj.weight', 'deberta.encoder.layer.39.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.39.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.40.attention.self.q_bias', 'deberta.encoder.layer.40.attention.self.v_bias', 'deberta.encoder.layer.40.attention.self.in_proj.weight', 'deberta.encoder.layer.40.attention.self.pos_proj.weight', 'deberta.encoder.layer.40.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.40.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.41.attention.self.q_bias', 'deberta.encoder.layer.41.attention.self.v_bias', 'deberta.encoder.layer.41.attention.self.in_proj.weight', 'deberta.encoder.layer.41.attention.self.pos_proj.weight', 'deberta.encoder.layer.41.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.41.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.42.attention.self.q_bias', 'deberta.encoder.layer.42.attention.self.v_bias', 'deberta.encoder.layer.42.attention.self.in_proj.weight', 'deberta.encoder.layer.42.attention.self.pos_proj.weight', 'deberta.encoder.layer.42.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.42.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.43.attention.self.q_bias', 'deberta.encoder.layer.43.attention.self.v_bias', 'deberta.encoder.layer.43.attention.self.in_proj.weight', 'deberta.encoder.layer.43.attention.self.pos_proj.weight', 'deberta.encoder.layer.43.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.43.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.44.attention.self.q_bias', 'deberta.encoder.layer.44.attention.self.v_bias', 'deberta.encoder.layer.44.attention.self.in_proj.weight', 'deberta.encoder.layer.44.attention.self.pos_proj.weight', 'deberta.encoder.layer.44.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.44.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.45.attention.self.q_bias', 'deberta.encoder.layer.45.attention.self.v_bias', 'deberta.encoder.layer.45.attention.self.in_proj.weight', 'deberta.encoder.layer.45.attention.self.pos_proj.weight', 'deberta.encoder.layer.45.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.45.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.46.attention.self.q_bias', 'deberta.encoder.layer.46.attention.self.v_bias', 'deberta.encoder.layer.46.attention.self.in_proj.weight', 'deberta.encoder.layer.46.attention.self.pos_proj.weight', 'deberta.encoder.layer.46.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.46.attention.self.pos_q_proj.bias', 'deberta.encoder.layer.47.attention.self.q_bias', 'deberta.encoder.layer.47.attention.self.v_bias', 'deberta.encoder.layer.47.attention.self.in_proj.weight', 'deberta.encoder.layer.47.attention.self.pos_proj.weight', 'deberta.encoder.layer.47.attention.self.pos_q_proj.weight', 'deberta.encoder.layer.47.attention.self.pos_q_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

@LysandreJik
Copy link
Member

Apart from the two issues mentioned above, the PR looks in a good state to me. Would you mind:

  • Checking what's wrong with microsoft/deberta-xlarge-mnli
  • Adding the num_labels field to the configuration of your MNLI models and removing the pre-load hooks
  • Rebasing on the current master

I can take care of 2) and 3) if you want.

@BigBird01
Copy link
Contributor Author

Apart from the two issues mentioned above, the PR looks in a good state to me. Would you mind:

  • Checking what's wrong with microsoft/deberta-xlarge-mnli
  • Adding the num_labels field to the configuration of your MNLI models and removing the pre-load hooks
  • Rebasing on the current master

I can take care of 2) and 3) if you want.

Thanks @LysandreJik. I just fixed the model issue and resolved the merge conflicts.
For the hook issue, add num_labels will not fix the issue. In most of the cases we want to load a mnli fine-tuned model for another task, which has 2 or 1 labels, e.g. MRPC, STS-2, SST-B. So we still need the hook unless we get the loading issue fixed in load_pretrained_model method. One possible way is to add ignore error dictionary just like ignore_unexpected keys. But I think we should fix this in another separate PR.

@LysandreJik
Copy link
Member

Thank you for taking care of those issues.

@patrickvonplaten @sgugger, could you give this one a look?

The unresolved issue is regarding the pre-load hooks. Loading a pre-trained model that already has a classification head with a different number of labels will not work, as the weight will have the wrong numbers of parameters.

Until now, we've been doing:

from transformers import DebertaV2Model, DebertaV2ForSequenceClassification

seq_model = DebertaV2ForSequenceClassification.from_pretrained("xxx", num_labels=4)
seq_model.save_pretrained(directory)

base = DebertaV2Model.from_pretrained(directory)  # Lose the head
base.save_pretrained(directory)

seq_model = DebertaV2ForSequenceClassification.from_pretrained(directory, num_labels=8)

The pre-load hook that @BigBird01 worked on drops the head instead when it finds it is ill-loaded. I'm okay to merge it like this, and I'll work on a model-agnostic approach this week. Let me know your thoughts.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for adding this model! Just left a few nits.

With respect to the pre-hook load, I'm fine with merging the functionality for this model before looking at a more general way of having it for all models.

docs/source/model_doc/deberta_v2.rst Outdated Show resolved Hide resolved
src/transformers/models/deberta/modeling_deberta.py Outdated Show resolved Hide resolved
src/transformers/models/deberta_v2/modeling_deberta_v2.py Outdated Show resolved Hide resolved
tests/test_tokenization_deberta_v2.py Outdated Show resolved Hide resolved
tests/test_tokenization_deberta_v2.py Outdated Show resolved Hide resolved
@eyal-str
Copy link

@LysandreJik Thanks for the fix.
Can you merge this PR, please?

@@ -54,7 +58,7 @@ def __init__(self, config):
self.dropout = StableDropout(config.pooler_dropout)
self.config = config

def forward(self, hidden_states, mask=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch

return self.drop_prob


def MaskedLayerNorm(layerNorm, input, mask=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think this design choice is a bit confusing. The function is only applied once and it's a bit surprising to me that it's upper case.

I would simply add the necessary code to the only occurrence of MaskedLayerNorm below - it makes the code more readable IMO

rmask = (1 - input_mask).bool()
out.masked_fill_(rmask.unsqueeze(-1).expand(out.size()), 0)
out = ACT2FN[self.conv_act](self.dropout(out))
output_states = MaskedLayerNorm(self.LayerNorm, residual_states + out, input_mask)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only time MaskedLayerNorm is used. I would simply add the logic required for MaskedLayerNorm here -> it improves readability a lot

self.LayerNorm = LayerNorm(config.hidden_size, config.layer_norm_eps, elementwise_affine=True)

kernel_size = getattr(config, "conv_kernel_size", 0)
self.with_conv = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we maybe just have:

self.conv = ConvLayer(config) if getattr(config, "conv_kernel_size", 0) > 0 else None

and delete self.with_conv? Then below we can just check whether self.conv is None or not -> it saves us a couple of lines and the attribute self.with_conv which is hardcoded here

self.with_conv = True
self.conv = ConvLayer(config)

def get_rel_embedding(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function seems to compute the same output everytime it's being called -> can we maybe just save the output in init instead of computing it again?

self,
hidden_states,
attention_mask,
output_hidden_states=True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be set to False by default I think



def make_log_bucket_position(relative_pos, bucket_size, max_position):
sign = np.sign(relative_pos)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe in a follow up PR we can have this in PyTorch - but ok for me for now!

"heads (%d)" % (config.hidden_size, config.num_attention_heads)
)
self.num_attention_heads = config.num_attention_heads
_attention_head_size = int(config.hidden_size / config.num_attention_heads)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) this is the same as config.hidden_size // config.num_attention_heads

)

# bxhxlxd
_attention_probs = XSoftmax.apply(attention_scores, attention_mask, -1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why _attention_probs instead of attention_probs here?

relative_pos = relative_pos.unsqueeze(0).unsqueeze(0)
elif relative_pos.dim() == 3:
relative_pos = relative_pos.unsqueeze(1)
# bxhxqxk
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) these comments look a bit crypitc -> maybe (bsz x ... x ...) would be better

if not os.path.isfile(vocab_file):
raise ValueError(
"Can't find a vocabulary file at path '{}'. To load the vocabulary from a Google pretrained "
"model use `tokenizer = XxxTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`".format(vocab_file)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"model use `tokenizer = XxxTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`".format(vocab_file)
"model use `tokenizer = DebertaV2Tokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`".format(vocab_file)

@@ -773,6 +775,18 @@ def _init_weights(self, module):
if isinstance(module, nn.Linear) and module.bias is not None:
module.bias.data.zero_()

def _pre_load_hook(self, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some more explanation here on what this does. Also, we are sure that this doesn't have any backward breaking changes for deberta_v1 no?

Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Thanks so much for adding this super important model @BigBird01 ! I left a couple of comments in the modeling_deberta_v2.py file - it would be great if we can make the code a bit cleaner there, e.g.:

  • remove the use_conv attribute
  • set output_hidden_states=False as a default
  • refactor the MaskLayerNorm class

Those changes should be pretty trivial - thanks so much for all your work!

@BigBird01
Copy link
Contributor Author

Awesome! Thanks so much for adding this super important model @BigBird01 ! I left a couple of comments in the modeling_deberta_v2.py file - it would be great if we can make the code a bit cleaner there, e.g.:

  • remove the use_conv attribute
  • set output_hidden_states=False as a default
  • refactor the MaskLayerNorm class

Those changes should be pretty trivial - thanks so much for all your work!

Thank you @patrickvonplaten! I will take a look at it soon.

@LysandreJik
Copy link
Member

As seen with @BigBird01, taking over the PR!

LysandreJik and others added 4 commits February 19, 2021 16:33
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
@LysandreJik LysandreJik merged commit 9a7e637 into huggingface:master Feb 19, 2021
@BigBird01
Copy link
Contributor Author

As seen with @BigBird01, taking over the PR!

Thank you @LysandreJik !

@LysandreJik
Copy link
Member

My pleasure! Thank you for your work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants