Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add memory-efficient attention and optional features to Llama #22386

Closed
wants to merge 7 commits into from
Closed

Conversation

s-JoL
Copy link
Contributor

@s-JoL s-JoL commented Mar 27, 2023

This PR adds memory-efficient attention to Llama, resulting in a 30% improvement in training efficiency. We also removed some transposes to adapt to the shapes allowed by the memory_efficient_attention operation. Additionally, we have added hidden dropout and attention dropout to the model, which helps with better generalization during training.

Furthermore, two optional features have been added: stable embedding, used in Bloom, and shared input-output vectors, used in PALM. These features have been tested and found to improve training stability and performance.
The main changes are as follows:

if xops is not None and self.training:
   attn_weights = None
   attn_output = xops.memory_efficient_attention(query_states, key_states, value_states, 
      attn_bias=self.causal_mask, p=self.dropout_prob)

As we use operators from the xformers library, we need to add a dependency on xformers.

We implemented pre-training of the Llama model based on transformers + accelerate, incorporating the modifications described above.
https://github.com/Bayes-Song/Open-Llama/blob/main/README_en.md

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your PR. Transformers is not meant to be a modular toolbox, so we don't add every feature to every model. Llama was trained without stable embedding or shared input-output vectors, so we won't add them to the modeling code of Llama. Likewise for the dropouts.

Since you are training new models using this code, as soon as you have checkpoints available, I would advise to make a PR with a new model (mostly copied from Llama) like we have all the variants of GPT-2 for instance.

@s-JoL
Copy link
Contributor Author

s-JoL commented Mar 27, 2023

Thanks for your PR. Transformers is not meant to be a modular toolbox, so we don't add every feature to every model. Llama was trained without stable embedding or shared input-output vectors, so we won't add them to the modeling code of Llama. Likewise for the dropouts.

Since you are training new models using this code, as soon as you have checkpoints available, I would advise to make a PR with a new model (mostly copied from Llama) like we have all the variants of GPT-2 for instance.

Thank you for your response. The memory_efficient_attention in xformers is actually mentioned in the original Llama paper. So, it is possible to integrate this component into the Llama training code.

@abodacs
Copy link
Contributor

abodacs commented Apr 9, 2023

@Bayes-Song
Thanks for the PR
Can we use this when Torch2.0 is supported?
Like in
https://github.com/huggingface/diffusers/pull/2303/files

cc: @sgugger

@sgugger
Copy link
Collaborator

sgugger commented Apr 10, 2023

If it's non-breaking and actually faster on all setups, we can add it yes. The PR makes other modifications for the time being, which we cannot accept as mentioned in my comment above.

@s-JoL
Copy link
Contributor Author

s-JoL commented Apr 13, 2023

Currently I have trained a new model based on the above changes, and I am adding a new model to the transformers library based on @sgugger 's suggestion. I will re-open a PR after I finish all the code.

@s-JoL s-JoL closed this Apr 21, 2023
@s-JoL s-JoL deleted the main branch April 28, 2023 14:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants