-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pass to transfer the strided access pattern from L3 to L2 #792
Conversation
compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIETransferStridedAccessPattern.cpp
Outdated
Show resolved
Hide resolved
compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIETransferStridedAccessPattern.cpp
Outdated
Show resolved
Hide resolved
compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIETransferStridedAccessPattern.cpp
Outdated
Show resolved
Hide resolved
compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIETransferStridedAccessPattern.cpp
Outdated
Show resolved
Hide resolved
compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIETransferStridedAccessPattern.cpp
Show resolved
Hide resolved
compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIETransferStridedAccessPattern.cpp
Outdated
Show resolved
Hide resolved
compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIETransferStridedAccessPattern.cpp
Outdated
Show resolved
Hide resolved
compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIETransferStridedAccessPattern.cpp
Outdated
Show resolved
Hide resolved
compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIETransferStridedAccessPattern.cpp
Outdated
Show resolved
Hide resolved
compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIETransferStridedAccessPattern.cpp
Outdated
Show resolved
Hide resolved
compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIETransferStridedAccessPattern.cpp
Outdated
Show resolved
Hide resolved
return emitError(rewriter.getUnknownLoc()) | ||
<< "failed to get dim position for combination"; | ||
} | ||
size_t dimForCombine = isCombinable.value(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now, you're looking for a single dimension that can be combined with the innermost dimension? Ideally, this would work for multiple dimensions as well. for example:
[[0, 0, 0, 0] [3, 2, 32, 32] [64, 32, 128, 1]]
should become:
[[0, 0] [32, 192] [128, 1]]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this case should be considered. However, I'd leave out such change in this revision until the logic of new sizes/strides (see the other comments below) is confirmed correct.
compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIETransferStridedAccessPattern.cpp
Outdated
Show resolved
Hide resolved
compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIETransferStridedAccessPattern.cpp
Outdated
Show resolved
Hide resolved
compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIETransferStridedAccessPattern.cpp
Outdated
Show resolved
Hide resolved
compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/AMDAIETransferStridedAccessPattern.cpp
Show resolved
Hide resolved
auto getNewL2Strides = [&](SmallVector<int64_t> values) { | ||
SmallVector<OpFoldResult> res = {getAsIndexOpFoldResult(ctx, 1)}; | ||
int64_t initial = values.back(); | ||
// Leave out one dimension for insertion afterwards | ||
for (size_t i = values.size() - 2; i > 0; i--) { | ||
initial *= values[i]; | ||
res.push_back(getAsIndexOpFoldResult(ctx, initial)); | ||
} | ||
return llvm::to_vector(llvm::reverse(res)); | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain how to calculate the new strides? I don't understand how it's just initial *= values[i];
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take the following dma ops for example:
%45 = amdaie.npu.circular_dma_cpy_nd %8([0] [2048] [1], [] [] [])
%46 = amdaie.npu.dma_cpy_nd %8([] [] [], %31[0, 0, 0, %41] [4, 2, 32, 32] [4096, 32, 128, 1])
The logic is to create L2 side strides from the innermost dimension, and then reverse the vector to have the final order. The new L2 side strides always start with [1], and should have the same number of dimensions as the original L3 side source addressing. The next dimensions are calculated by the logic initial *= l3OrigSizes[i]
.
The initial
means the innermost continuous elements which is l3OrigSizes[-1]* l3OrigStrides.back[-1]
(the implementation omit l3OrigStrides[-1] because l3OrigStrides[-1] == 1). The combined elements are now continuous on L3 side, but should have a strided addressing on L2 side, the stride should be initial * l3OrigSizes[-2]
. So after this iteration, the strides are [1, 32 * 32].
Same logic for the next iteration, the strides become [1, 32 * 32, 32 * 32 * 2] = [1, 1024, 2048]. After reversion, it's [2048, 1024, 1]. At last insert the stride for the position of the combined dimension (e.g., index 1 in this example), which is l3OrigStride[dimForCombine]
, and get final strides [2048, 32, 1024, 1].
Let me know if this is the correct logic to get the L2 side strides, or if there's a better way to calculate this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The combined elements are now continuous on L3 side, but should have a strided addressing on L2 side, the stride should be initial * l3OrigSizes[-2].
Yeah, I think the core idea is good: i.e. that the strides should be created as if the original L3 was contiguous and then rearranged based on which dimension(s) are combined with the innermost one.
However, I do think this needs extensive tests to ensure correctness as this will otherwise lead to hard-debug numerical errors in the future. So, it would be good to create a standalone utility function that takes in a set of static offsets/sizes/strides and produces the new static L3 and L2 offsets/sizes/strides, so that it can be tested standalone (ctest, not lit) on a lot of different cases, see for example: https://github.com/nod-ai/iree-amd-aie/blob/main/compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/test/AMDAIEDmaUtilsTest.cpp
circularDma.getTargetMixedStrides(); | ||
|
||
// Change the source/target addressing of all users from a connection op. | ||
for (Operation *user : connectionOp->getUsers()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if different NPU DMA users have different strides/sizes/offsets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if we would have such cases. I looked through the current IR, and only find the case that the connection op has multiple NpuDmaCpyNdOp users (just with different offsets) and one NpuCircularDmaCpyNdOp user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do see it with peeled matmul and regardless, we should check for it and return/abort.
This PR aims to improve the control code and thus matmul performance by moving some dma instructions from L3 to L2 side. The detail of the idea is in #764 (comment).
This pass has not been plugged into e2e passes. Some change of
AMDAIEDmaLoopSubsumption
is needed to make it fully work. Ideally, I'm thinking to make it as a pattern insideAMDAIEDmaComposition
and running all the npu dma optimization at once instead of running as a sequence ofDmaComposition -> TransferStridedAccessPattern -> CanonicalizatizeDoublyStridedOp -> DmaSubsumption
.