-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Deberta/Deberta-v2
] Refactor code base to support compile, export, and fix LLM
#22105
[Deberta/Deberta-v2
] Refactor code base to support compile, export, and fix LLM
#22105
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
Hey @ArthurZucker any updates on this? ETA for when it will be merged into main? |
Hey! Just got back from holidays, this should be my main focus in the coming days! |
Sorry! Seem like I had to postpone this! If anyone want to take over feel free to do it, otherwise will be my priority once #23909 is merge! |
Regarding the if attention_mask.dim() <= 2:
extended_attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)
att_mask = extended_attention_mask.byte()
attention_mask = att_mask * att_mask.squeeze(-2).unsqueeze(-1)
elif attention_mask.dim() == 3:
attention_mask = attention_mask.unsqueeze(1)
target_mask = target_ids > 0
hidden_states = encoder_layers[-2]
if not self.position_biased_input:
layers = [encoder.layer[-1] for _ in range(2)]
z_states += hidden_states
query_states = z_states
query_mask = attention_mask
outputs = []
rel_embeddings = encoder.get_rel_embedding()
for layer in layers:
# TODO: pass relative pos ids
output = layer(hidden_states, query_mask, return_att=False, query_states=query_states,
relative_pos=relative_pos, rel_embeddings=rel_embeddings)
query_states = output
outputs.append(query_states)
else:
outputs = [encoder_layers[-1]] As far as I can tell, they hardcoded z_steps to 2 here. Although it should still be left as 0 for the discriminator. Adding the z_steps to the config seems like a good idea.
class DebertaV2Embeddings(nn.Module):
def forward(self, input_ids=None, token_type_ids=None, position_ids=None, mask=None, inputs_embeds=None):
...
return embeddings, position_embeddings and implement the class DebertaV2Model(DebertaV2PreTrainedModel):
def forward(
self,
input_ids: Optional[torch.Tensor] = None,
attention_mask: Optional[torch.Tensor] = None,
token_type_ids: Optional[torch.Tensor] = None,
position_ids: Optional[torch.Tensor] = None,
inputs_embeds: Optional[torch.Tensor] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
) -> Union[Tuple, BaseModelOutput]:
...
embedding_output, position_embedding_output = self.embeddings(
input_ids=input_ids,
token_type_ids=token_type_ids,
position_ids=position_ids,
mask=attention_mask,
inputs_embeds=inputs_embeds,
)
...
if self.z_steps > 0:
hidden_states = encoded_layers[-2]
layers = [self.encoder.layer[-1] for _ in range(self.z_steps)]
position_embedding_output += hidden_states
query_states = position_embedding_output
query_mask = self.encoder.get_attention_mask(attention_mask)
rel_embeddings = self.encoder.get_rel_embedding()
rel_pos = self.encoder.get_rel_pos(embedding_output)
for layer in layers:
query_states = layer(
hidden_states,
query_mask,
output_attentions=False,
query_states=query_states,
relative_pos=rel_pos,
rel_embeddings=rel_embeddings,
)
encoded_layers = encoded_layers + (query_states,) |
What is the status? |
#27734 should help with some of the issues in the mean time |
Any update on this? We're blocked by some of the issues that this is meant to fix. Sepcifically: |
Hi there, just checking in @ArthurZucker on whether there's any progress here please? |
Hey hey! Sorry I ended up dropping this, let me get back to you next week! |
I reviewed #27734, will take it over this weekend if possible. |
Anyone up for the task can try to do it as well 🤗 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok! Can you document in DeBERTa's documentation page the evolution that the integration had? I think it's important that users have easily accessible information about the initial contribution, and how this refactor contributes to improving every aspect of DeBERTa.
Thanks!
Had to skip some pt-tf equivalence tests. The slow tests ran for me and are passing. |
Nice one, thanks @ArthurZucker ! |
Description After a recent change in transformers (huggingface/transformers#22105), PEFT could no longer determine the word embeddings from Deberta. This PR provides a very minimal fix that correctly determines the word embeddings again. Details Previously, the word embeddings were determined in the following manner: 1. Find the transformers_backbone by checking the base model's children for PreTrainedModel instances 2. If not found, the model itself is considered the transformers backbone. 3. On the backbone, check for modules whose weight has the same size as the vocab size. This module is now assumed to be the word embeddings. Before the mentioned transformers PR, 1. did not find anything, so 2. was applied. After the PR, however, the DebertaEncoder is now an instance of PreTrainedModel (asked internally, this is intended). Therefore, the encoder is now considered the transformer backbone. But the encoder does not have the word embeddings attribute, therefore step 3. fails. The fix of this PR is to first explicitly check for model.embeddings.word_embeddings and if this attribute is found, use it as the word embeddings. Only when it's not found do we use the other method described above. This way, we can successfully determine the word embeddings on models like Deberta. This whole code is a bit messy and could probably be improved. However, changing the logic too much could inadvertently break for some existing models that are not included in the tests. Therefore, I chose this method which leaves the existing logic mostly intact.
After a recent change in transformers (huggingface/transformers#22105), PEFT could no longer determine the word embeddings from Deberta. This PR provides a very minimal fix that correctly determines the word embeddings again. Details Previously, the word embeddings were determined in the following manner: 1. Find the transformers_backbone by checking the base model's children for PreTrainedModel instances 2. If not found, the model itself is considered the transformers backbone. 3. On the backbone, check for modules whose weight has the same size as the vocab size. This module is now assumed to be the word embeddings. Before the mentioned transformers PR, 1. did not find anything, so 2. was applied. After the PR, however, the DebertaEncoder is now an instance of PreTrainedModel (asked internally, this is intended). Therefore, the encoder is now considered the transformer backbone. But the encoder does not have the word embeddings attribute, therefore step 3. fails. The fix of this PR is to first explicitly check for model.embeddings.word_embeddings and if this attribute is found, use it as the word embeddings. Only when it's not found do we use the other method described above. This way, we can successfully determine the word embeddings on models like Deberta. This whole code is a bit messy and could probably be improved. However, changing the logic too much could inadvertently break for some existing models that are not included in the tests. Therefore, I chose this method which leaves the existing logic mostly intact.
… and fix LLM (huggingface#22105) * some modification for roadmap * revert some changes * yups * weird * make it work * sttling * fix-copies * fixup * renaming * more fix-copies * move stuff around * remove torch script warnings * ignore copies * revert bad changes * woops * just styling * nit * revert * style fixup * nits configuration style * fixup * nits * will this fix the tf pt issue? * style * ??????? * update * eval? * update error message * updates * style * grumble grumble * update * style * nit * skip torch fx tests that were failing * style * skip the failing tests * skip another test and make style
… and fix LLM (huggingface#22105) * some modification for roadmap * revert some changes * yups * weird * make it work * sttling * fix-copies * fixup * renaming * more fix-copies * move stuff around * remove torch script warnings * ignore copies * revert bad changes * woops * just styling * nit * revert * style fixup * nits configuration style * fixup * nits * will this fix the tf pt issue? * style * ??????? * update * eval? * update error message * updates * style * grumble grumble * update * style * nit * skip torch fx tests that were failing * style * skip the failing tests * skip another test and make style
What does this PR do?
Refactor both Deberta and DebertaV2 to make them more compatible with the overall transformers library
Should fix a bunch of issues related to torch-scripting with Deberta:
fixes #15216
fixes #15673
fixes #16456
fixes #18659
fixes #21300
fixes #20815
fixes #12436
fixes #18674