-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add memory-efficient attention and optional features to Llama #22386
Conversation
… shared input/output embedding, add hidden/attention dropout
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your PR. Transformers is not meant to be a modular toolbox, so we don't add every feature to every model. Llama was trained without stable embedding or shared input-output vectors, so we won't add them to the modeling code of Llama. Likewise for the dropouts.
Since you are training new models using this code, as soon as you have checkpoints available, I would advise to make a PR with a new model (mostly copied from Llama) like we have all the variants of GPT-2 for instance.
Thank you for your response. The memory_efficient_attention in xformers is actually mentioned in the original Llama paper. So, it is possible to integrate this component into the Llama training code. |
@Bayes-Song cc: @sgugger |
If it's non-breaking and actually faster on all setups, we can add it yes. The PR makes other modifications for the time being, which we cannot accept as mentioned in my comment above. |
Currently I have trained a new model based on the above changes, and I am adding a new model to the transformers library based on @sgugger 's suggestion. I will re-open a PR after I finish all the code. |
This PR adds memory-efficient attention to Llama, resulting in a 30% improvement in training efficiency. We also removed some transposes to adapt to the shapes allowed by the memory_efficient_attention operation. Additionally, we have added hidden dropout and attention dropout to the model, which helps with better generalization during training.
Furthermore, two optional features have been added: stable embedding, used in Bloom, and shared input-output vectors, used in PALM. These features have been tested and found to improve training stability and performance.
The main changes are as follows:
As we use operators from the xformers library, we need to add a dependency on xformers.
We implemented pre-training of the Llama model based on transformers + accelerate, incorporating the modifications described above.
https://github.com/Bayes-Song/Open-Llama/blob/main/README_en.md