Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Deberta/Deberta-v2] Refactor code base to support compile, export, and fix LLM #22105

Merged
merged 43 commits into from
Nov 25, 2024

Conversation

ArthurZucker
Copy link
Collaborator

@ArthurZucker ArthurZucker commented Mar 11, 2023

What does this PR do?

Refactor both Deberta and DebertaV2 to make them more compatible with the overall transformers library

Should fix a bunch of issues related to torch-scripting with Deberta:
fixes #15216
fixes #15673
fixes #16456
fixes #18659
fixes #21300
fixes #20815
fixes #12436
fixes #18674

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@ArthurZucker ArthurZucker changed the title [WIP] Refactor [WIP] Refactor Deberta/Deberta-v2 Mar 13, 2023
@ArthurZucker ArthurZucker self-assigned this Mar 13, 2023
@huggingface huggingface deleted a comment from github-actions bot Apr 11, 2023
@github-actions github-actions bot closed this May 13, 2023
@hriaz17
Copy link

hriaz17 commented May 22, 2023

Hey @ArthurZucker any updates on this? ETA for when it will be merged into main?

@ArthurZucker ArthurZucker reopened this May 23, 2023
@ArthurZucker
Copy link
Collaborator Author

Hey! Just got back from holidays, this should be my main focus in the coming days!

@ArthurZucker
Copy link
Collaborator Author

Sorry! Seem like I had to postpone this! If anyone want to take over feel free to do it, otherwise will be my priority once #23909 is merge!

@zynaa
Copy link

zynaa commented Jun 29, 2023

Regarding the z_steps in DebertaV2Model: I think this code is relevant for the enhanced mask decoder of the generator model

if attention_mask.dim() <= 2:
    extended_attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)
    att_mask = extended_attention_mask.byte()
    attention_mask = att_mask * att_mask.squeeze(-2).unsqueeze(-1)
elif attention_mask.dim() == 3:
    attention_mask = attention_mask.unsqueeze(1)
target_mask = target_ids > 0
hidden_states = encoder_layers[-2]
if not self.position_biased_input:
    layers = [encoder.layer[-1] for _ in range(2)]
    z_states += hidden_states
    query_states = z_states
    query_mask = attention_mask
    outputs = []
    rel_embeddings = encoder.get_rel_embedding()

    for layer in layers:
        # TODO: pass relative pos ids
        output = layer(hidden_states, query_mask, return_att=False, query_states=query_states,
                       relative_pos=relative_pos, rel_embeddings=rel_embeddings)
        query_states = output
        outputs.append(query_states)
else:
    outputs = [encoder_layers[-1]]

As far as I can tell, they hardcoded z_steps to 2 here. Although it should still be left as 0 for the discriminator. Adding the z_steps to the config seems like a good idea.

z_states represents the position embeddings, which are non-zero if position_biased_input is set to True. They are passed from the embedding layer. So in order to properly implement this, I think we need to return the position embeddings here:

class DebertaV2Embeddings(nn.Module):
    def forward(self, input_ids=None, token_type_ids=None, position_ids=None, mask=None, inputs_embeds=None):
        ...

        return embeddings, position_embeddings

and implement the z_steps like this:

class DebertaV2Model(DebertaV2PreTrainedModel):
    def forward(
        self,
        input_ids: Optional[torch.Tensor] = None,
        attention_mask: Optional[torch.Tensor] = None,
        token_type_ids: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.Tensor] = None,
        inputs_embeds: Optional[torch.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, BaseModelOutput]:
        ...

        embedding_output, position_embedding_output = self.embeddings(
            input_ids=input_ids,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            mask=attention_mask,
            inputs_embeds=inputs_embeds,
        )
        ...

        if self.z_steps > 0:
            hidden_states = encoded_layers[-2]
            layers = [self.encoder.layer[-1] for _ in range(self.z_steps)]
            position_embedding_output += hidden_states
            query_states = position_embedding_output
            query_mask = self.encoder.get_attention_mask(attention_mask)
            rel_embeddings = self.encoder.get_rel_embedding()
            rel_pos = self.encoder.get_rel_pos(embedding_output)
            for layer in layers:
                query_states = layer(
                    hidden_states,
                    query_mask,
                    output_attentions=False,
                    query_states=query_states,
                    relative_pos=rel_pos,
                    rel_embeddings=rel_embeddings,
                )
                encoded_layers = encoded_layers + (query_states,)

@huggingface huggingface deleted a comment from github-actions bot Jul 24, 2023
@huggingface huggingface deleted a comment from github-actions bot Aug 17, 2023
@huggingface huggingface deleted a comment from github-actions bot Sep 13, 2023
@huggingface huggingface deleted a comment from github-actions bot Oct 13, 2023
@huggingface huggingface deleted a comment from github-actions bot Nov 8, 2023
@huggingface huggingface deleted a comment from github-actions bot Dec 4, 2023
@ArthurZucker ArthurZucker added the WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress label Jan 3, 2024
@huggingface huggingface deleted a comment from github-actions bot Jan 3, 2024
@Bachstelze
Copy link

What is the status?
The logs of the checks are expired.

@ArthurZucker
Copy link
Collaborator Author

#27734 should help with some of the issues in the mean time

@joshdevins
Copy link
Contributor

@serenachou
Copy link

Hi there, just checking in @ArthurZucker on whether there's any progress here please?

@ArthurZucker
Copy link
Collaborator Author

ArthurZucker commented Sep 6, 2024

Hey hey! Sorry I ended up dropping this, let me get back to you next week!

@ArthurZucker
Copy link
Collaborator Author

I reviewed #27734, will take it over this weekend if possible.

@ArthurZucker
Copy link
Collaborator Author

Anyone up for the task can try to do it as well 🤗

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok! Can you document in DeBERTa's documentation page the evolution that the integration had? I think it's important that users have easily accessible information about the initial contribution, and how this refactor contributes to improving every aspect of DeBERTa.

Thanks!

@ArthurZucker
Copy link
Collaborator Author

Had to skip some pt-tf equivalence tests. The slow tests ran for me and are passing.
If anyone has a problem will be quick to fix!

@ArthurZucker ArthurZucker merged commit 857d46c into huggingface:main Nov 25, 2024
26 checks passed
@ArthurZucker ArthurZucker deleted the refactor-deberta branch November 25, 2024 09:43
@joshdevins
Copy link
Contributor

Nice one, thanks @ArthurZucker !

BenjaminBossan added a commit to BenjaminBossan/peft that referenced this pull request Dec 4, 2024
Description

After a recent change in
transformers (huggingface/transformers#22105),
PEFT could no longer determine the word embeddings from Deberta. This PR
provides a very minimal fix that correctly determines the word
embeddings again.

Details

Previously, the word embeddings were determined in the following manner:

1. Find the transformers_backbone by checking the base model's children
for PreTrainedModel instances
2. If not found, the model itself is considered the transformers
backbone.
3. On the backbone, check for modules whose weight has the same size as
the vocab size. This module is now assumed to be the word embeddings.

Before the mentioned transformers PR, 1. did not find anything, so 2.
was applied. After the PR, however, the DebertaEncoder is now an
instance of PreTrainedModel (asked internally, this is intended).
Therefore, the encoder is now considered the transformer backbone. But
the encoder does not have the word embeddings attribute, therefore step
3. fails.

The fix of this PR is to first explicitly check for
model.embeddings.word_embeddings and if this attribute is found, use it
as the word embeddings. Only when it's not found do we use the other
method described above. This way, we can successfully determine the word
embeddings on models like Deberta.

This whole code is a bit messy and could probably be improved. However,
changing the logic too much could inadvertently break for some existing
models that are not included in the tests. Therefore, I chose this
method which leaves the existing logic mostly intact.
BenjaminBossan added a commit to huggingface/peft that referenced this pull request Dec 4, 2024
After a recent change in
transformers (huggingface/transformers#22105),
PEFT could no longer determine the word embeddings from Deberta. This PR
provides a very minimal fix that correctly determines the word
embeddings again.

Details

Previously, the word embeddings were determined in the following manner:

1. Find the transformers_backbone by checking the base model's children
for PreTrainedModel instances
2. If not found, the model itself is considered the transformers
backbone.
3. On the backbone, check for modules whose weight has the same size as
the vocab size. This module is now assumed to be the word embeddings.

Before the mentioned transformers PR, 1. did not find anything, so 2.
was applied. After the PR, however, the DebertaEncoder is now an
instance of PreTrainedModel (asked internally, this is intended).
Therefore, the encoder is now considered the transformer backbone. But
the encoder does not have the word embeddings attribute, therefore step
3. fails.

The fix of this PR is to first explicitly check for
model.embeddings.word_embeddings and if this attribute is found, use it
as the word embeddings. Only when it's not found do we use the other
method described above. This way, we can successfully determine the word
embeddings on models like Deberta.

This whole code is a bit messy and could probably be improved. However,
changing the logic too much could inadvertently break for some existing
models that are not included in the tests. Therefore, I chose this
method which leaves the existing logic mostly intact.
BernardZach pushed a commit to BernardZach/transformers that referenced this pull request Dec 5, 2024
… and fix LLM (huggingface#22105)

* some modification for roadmap

* revert some changes

* yups

* weird

* make it work

* sttling

* fix-copies

* fixup

* renaming

* more fix-copies

* move stuff around

* remove torch script warnings

* ignore copies

* revert bad changes

* woops

* just styling

* nit

* revert

* style fixup

* nits configuration style

* fixup

* nits

* will this fix the tf pt issue?

* style

* ???????

* update

* eval?

* update error message

* updates

* style

* grumble grumble

* update

* style

* nit

* skip torch fx tests that were failing

* style

* skip the failing tests

* skip another test and make style
BernardZach pushed a commit to innovationcore/transformers that referenced this pull request Dec 6, 2024
… and fix LLM (huggingface#22105)

* some modification for roadmap

* revert some changes

* yups

* weird

* make it work

* sttling

* fix-copies

* fixup

* renaming

* more fix-copies

* move stuff around

* remove torch script warnings

* ignore copies

* revert bad changes

* woops

* just styling

* nit

* revert

* style fixup

* nits configuration style

* fixup

* nits

* will this fix the tf pt issue?

* style

* ???????

* update

* eval?

* update error message

* updates

* style

* grumble grumble

* update

* style

* nit

* skip torch fx tests that were failing

* style

* skip the failing tests

* skip another test and make style
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment