Add memory-efficient attention and optional features to Llama #22386

s-JoL · 2023-03-27T03:21:04Z

This PR adds memory-efficient attention to Llama, resulting in a 30% improvement in training efficiency. We also removed some transposes to adapt to the shapes allowed by the memory_efficient_attention operation. Additionally, we have added hidden dropout and attention dropout to the model, which helps with better generalization during training.

Furthermore, two optional features have been added: stable embedding, used in Bloom, and shared input-output vectors, used in PALM. These features have been tested and found to improve training stability and performance.
The main changes are as follows:

if xops is not None and self.training:
   attn_weights = None
   attn_output = xops.memory_efficient_attention(query_states, key_states, value_states, 
      attn_bias=self.causal_mask, p=self.dropout_prob)

As we use operators from the xformers library, we need to add a dependency on xformers.

We implemented pre-training of the Llama model based on transformers + accelerate, incorporating the modifications described above.
https://github.com/Bayes-Song/Open-Llama/blob/main/README_en.md

… shared input/output embedding, add hidden/attention dropout

HuggingFaceDocBuilderDev · 2023-03-27T03:40:16Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

sgugger

Thanks for your PR. Transformers is not meant to be a modular toolbox, so we don't add every feature to every model. Llama was trained without stable embedding or shared input-output vectors, so we won't add them to the modeling code of Llama. Likewise for the dropouts.

Since you are training new models using this code, as soon as you have checkpoints available, I would advise to make a PR with a new model (mostly copied from Llama) like we have all the variants of GPT-2 for instance.

s-JoL · 2023-03-27T14:50:27Z

Thanks for your PR. Transformers is not meant to be a modular toolbox, so we don't add every feature to every model. Llama was trained without stable embedding or shared input-output vectors, so we won't add them to the modeling code of Llama. Likewise for the dropouts.

Since you are training new models using this code, as soon as you have checkpoints available, I would advise to make a PR with a new model (mostly copied from Llama) like we have all the variants of GPT-2 for instance.

Thank you for your response. The memory_efficient_attention in xformers is actually mentioned in the original Llama paper. So, it is possible to integrate this component into the Llama training code.

abodacs · 2023-04-09T14:34:05Z

@Bayes-Song
Thanks for the PR
Can we use this when Torch2.0 is supported?
Like in
https://github.com/huggingface/diffusers/pull/2303/files

cc: @sgugger

sgugger · 2023-04-10T11:59:29Z

If it's non-breaking and actually faster on all setups, we can add it yes. The PR makes other modifications for the time being, which we cannot accept as mentioned in my comment above.

s-JoL · 2023-04-13T07:32:25Z

Currently I have trained a new model based on the above changes, and I am adding a new model to the transformers library based on @sgugger 's suggestion. I will re-open a PR after I finish all the code.

s-JoL added 4 commits March 26, 2023 18:54

add memory_efficient_attention in training, add stable embedding, add…

01f4ba7

… shared input/output embedding, add hidden/attention dropout

Merge branch 'huggingface:main' into main

2e4c36d

fix bug

10bf19c

Adapting to the situation where transformers are not available

65b7190

songliang.bayesian added 3 commits March 27, 2023 14:32

reformat code

2249d92

format code with ruff

6243666

update loss to ignore padding

88b7163

sgugger reviewed Mar 27, 2023

View reviewed changes

knoopx mentioned this pull request Apr 4, 2023

GPU performance increase with xformers oobabooga/text-generation-webui#695

Closed

fxmarty mentioned this pull request Apr 11, 2023

add xformers dep, xformers attn for gpt2 #22665

Closed

s-JoL closed this Apr 21, 2023

s-JoL mentioned this pull request Apr 21, 2023

add open-llama model with ckpt #22795

Merged

s-JoL deleted the main branch April 28, 2023 14:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add memory-efficient attention and optional features to Llama #22386

Add memory-efficient attention and optional features to Llama #22386

s-JoL commented Mar 27, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Mar 27, 2023

sgugger left a comment

s-JoL commented Mar 27, 2023

abodacs commented Apr 9, 2023

sgugger commented Apr 10, 2023

s-JoL commented Apr 13, 2023

Add memory-efficient attention and optional features to Llama #22386

Add memory-efficient attention and optional features to Llama #22386

Conversation

s-JoL commented Mar 27, 2023 • edited Loading

HuggingFaceDocBuilderDev commented Mar 27, 2023

sgugger left a comment

Choose a reason for hiding this comment

s-JoL commented Mar 27, 2023

abodacs commented Apr 9, 2023

sgugger commented Apr 10, 2023

s-JoL commented Apr 13, 2023

s-JoL commented Mar 27, 2023 •

edited

Loading