-
Notifications
You must be signed in to change notification settings - Fork 7
Improve matmul instruction scheduling with loop rotation #2488
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| assert(ind >= 0); | ||
| assert(ind <= max_ind); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to this PR, but asserting different conditions separately provides a better error message. (The line number in the error message will tell me which is violated).
naoyam
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
- apply improvement in matmul instruction scheduling with loop rotation
Introduction
Loop rotation is a lowering pass that transform
into
In the matmul kernel, both the
cp.asyncand theld.matrixare circular/double buffered. This PR applies loop rotation to the matmul main loop to pull the first iteration'sld.matrixout of the main loop ofcp.async.That is, to change the code from
to
In order to do so, I need to do a reorder to change the matmul schedule from
to
Because in the first schedule, the loop structure is
where inside the
cp.asynccircular buffer loop, the entireld.matrix->mmais contained in thethreadIdxtrivial loop, and theld.matrixis not separable.In contrast, for the second schedule, we have
The
blockIdxandthreadIdxloops are trivial loops, so this schedule change actually doesn't affect the generated CUDA kernel. However, it does make kernel IR easier to deal with.Benchmark
Using command
Before this PR:
After this PR: