-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
[Attention] Tune CUTLASS MLA num_splits #26846
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new heuristic for determining num_splits in CUTLASS MLA to improve performance. The new logic is based on the ratio of sequence length to batch size. While this is a reasonable approach for performance tuning, my review has identified a critical concern. The change removes a safeguard that was in place to prevent kernel hangs when the batch size is greater than one. Reintroducing this hang would be a critical issue, and it's not clear from the pull request description if the underlying problem has been resolved. I have left a comment detailing this concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
LucasWilkinson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! thanks for doing this!
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Purpose
Tune the num_splits heuristic for CUTLASS_MLA to achieve some speedup now that #26026 has fixed the hang. Based on experiments performed using the tools introduced in #26835, this is the optimal num_splits policy:
Following the optimal policy would yield this speedup:
As a simpler alternative, we implement a heuristic yielding the following policy:
This results in the following speedup:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.