-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[hybrid performance] softmax mask fuse upper triangle #33981
[hybrid performance] softmax mask fuse upper triangle #33981
Conversation
Thanks for your contribution! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
84dba9e
to
d2b0fee
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR types
New features
PR changes
OPs
Describe
Softmax mask fuse upper triangle.
With the observation that, for GPT kind structure, the attention mask is always be an upper triangle matrix that mask the upper triangle part of the QK product.
To save the time for creating mask and the HtoD time for the mask matrix (and may even save the time for communication of the mask between different stages of the PP), we fuse the softmax and mask (upper triangle) together.
Without this fusion:
With this fusion:
Performance gain (Static mode)
Precision check
How to use
For dygraph:
For static mode: