-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add FA2 & SDPA support for RoBERTa & XLM-RoBERTa #30450
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
thanks for this PR, use sdpa saved a ton of memory for GPU poor like me, super grateful |
@tomaarsen do you need a review on this one? |
@ArthurZucker Would be nice, though there's some conflicts now. I'll be off next week, so I'll be able to take care of the conflicts & any comments starting the 17th again.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty clean already, thanks a lot @tomaarsen ! Can you make sure to propagate the changes into the encoders that copy from Roberta by running make fix-copies
. You would also need to update this file: https://github.com/huggingface/transformers/blob/main/docs/source/en/perf_infer_gpu_one.md to mention Roberta and all other models that now support FA2 & SDPA.
You also need to fix the merge conflicts that should be easy to fix ! 🙏
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
…sen/transformers into feat/roberta_sdpa_fa2make
There is another similar PR by the way: #30510 |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Hello!
Pull Request overview
Details
The world of embedding models still very much relies on bert, roberta, xlm_roberta, mpnet, etc., but these model architectures have not yet received the benefits of FA2/SDPA. I'd like to make a start with that today.
I recognize that these models are tricky to change, as BERT especially is tangled in a big web of "Copied from" connections. However, I suspect that I've implemented FA2/SDPA such that it could be extended for a lot of architectures. However, I'd like to get reviews on the current implementation before I potentially expand to new architectures.
Most of the code is based on the Llama2 FA2/SDPA, so it should be fairly familiar. I want to note some limitations:
output_attentions
does not work for FA2/SDPA - this is fairly standard.head_mask
does not work for FA2/SDPA.position_embedding_type
with anything other than"absolute"
(i.e., the default) does not work for FA2/SDPA.Additionally, I have yet to write tests & I haven't tested all ways to use these models. Instead, I've only experimented with Sentence Transformers.
For a small RoBERTa-based model (https://huggingface.co/sentence-transformers/all-distilroberta-v1, 82M params), I get about a 10% speedup at one sample and a ~25% speedup at a large batch size with FA2 or SDPA. For a large XLM-RoBERTa-based model (https://huggingface.co/BAAI/bge-m3, 8192 sequence length), the speedup is up to 3x with FA2. Because newer embedding models are using larger sequence lengths, FA2/SDPA will become more important for them.
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@ArthurZucker @younesbelkada
If I have a bit of a go-ahead, I can move forward with other architectures. Let me know if you'd like me to work on tests first, though. I'm also aware that the "copies" tests will currently fail due to these changes.