-
-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for SDPA to NLLB in Huggingface Transformers #478
Labels
optimization
Model training/inferencing optimization
Comments
ddaspit
added
enhancement
New feature or request
pipeline 6: infer
Issue related to using a trained model to translate.
pipeline 4: train
Issue related to training a model.
optimization
Model training/inferencing optimization
and removed
enhancement
New feature or request
pipeline 6: infer
Issue related to using a trained model to translate.
pipeline 4: train
Issue related to training a model.
labels
Aug 9, 2024
PR submitted to transformers library last week, waiting on review. |
Here is the PR: huggingface/transformers#33309 |
Merged! |
That is awesome. Good job. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
NLLB currently supports FlashAttention in HF Transformers. Unfortunately, FlashAttention results in degradation of quality, because it does not properly support padding masks. SDPA provides an alternative route for applying attention optimizations. Under the hood, it supports FlashAttention and Memory Efficient Attention. Memory Efficient Attention should support masking. Here is the issue for adding SDPA support to models in Transformers. For a list of currently supported models, check out the Transformers documentation. A good example to follow would be BART, which has a full encoder-decoder architecture. It might also be useful to check out this PR that adds SDPA support to T5, another encoder-decoder LLM.
The text was updated successfully, but these errors were encountered: