-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NVIDIA][Backend] Add CoalesceAsyncCopy Pass for in-DotOpEnc Upcasting #5222
base: main
Are you sure you want to change the base?
Conversation
9a9fcb0
to
f1af158
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, few minor comments
Value mask = copyOp.getMask(); | ||
Value other = copyOp.getOther(); | ||
auto srcTy = cast<RankedTensorType>(src.getType()); | ||
auto blockEnc = cast<BlockedEncodingAttr>(srcTy.getEncoding()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can't assume the copy will use blocked layout
// replace the asyncCopy | ||
auto newCopyOp = rewriter.create<AsyncCopyGlobalToLocalOp>( | ||
copyOp.getLoc(), src, copyOp.getResult(), mask, other, | ||
copyOp.getCache(), copyOp.getEvict(), copyOp.getIsVolatile()); | ||
rewriter.replaceOp(copyOp, newCopyOp); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, you could do in place update
#include "mlir/Support/LLVM.h" | ||
#include "mlir/Transforms/Passes.h" | ||
#include "triton/Analysis/Utility.h" | ||
#include "triton/Conversion/TritonGPUToLLVM/Utility.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this is a bit of a layering violation, getRegToSharedLayout probably belongs to triton gpu dialect utils.
Addressed comments - moved util to |
This is a follow-up to the dotOp hoisting optimization for WGMMA (MMAv3). See #5003 (comment)
In short, when upcasting operand A in registers prior to WGMMA and when pipelining is enabled,
AsyncCopyGLobalToLocal
's src gmem blocked encoding will havesizePerThread
> smem view'svec
(along the contiguous dimension). This will resulting in multiplecp.async
instructions being generated for a contiguous global data segment, resulting in uncoalesced loads. This was previously confirmed in ncu. See above comment for an example.I've added a generalized fix in a new pass after the pipeliner. I've reused the logic in the LLVM lowering for
AsyncCopyGlobalToLocal
to calculate the max contiguous copy size. I compare that to the blockEnc'ssizePerThread
along the inner (contiguous) dimension. If the former is less than latter, I set the latter to former.When A is k-major, can verify a small perf improvement and that ncu no longer reports uncoalesced loads.
When A is m-major, this pass is a no-op because
copy size == sizePerThread == 16
ptal, thanks @ThomasRaoux