Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pass to transfer the strided access pattern from L3 to L2 #792

Closed
wants to merge 4 commits into from

Conversation

yzhang93
Copy link
Contributor

@yzhang93 yzhang93 commented Sep 20, 2024

This PR aims to improve the control code and thus matmul performance by moving some dma instructions from L3 to L2 side. The detail of the idea is in #764 (comment).

This pass has not been plugged into e2e passes. Some change of AMDAIEDmaLoopSubsumption is needed to make it fully work. Ideally, I'm thinking to make it as a pattern inside AMDAIEDmaComposition and running all the npu dma optimization at once instead of running as a sequence of DmaComposition -> TransferStridedAccessPattern -> CanonicalizatizeDoublyStridedOp -> DmaSubsumption.

return emitError(rewriter.getUnknownLoc())
<< "failed to get dim position for combination";
}
size_t dimForCombine = isCombinable.value();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, you're looking for a single dimension that can be combined with the innermost dimension? Ideally, this would work for multiple dimensions as well. for example:

[[0, 0, 0, 0] [3, 2, 32, 32] [64, 32, 128, 1]]

should become:

[[0, 0] [32, 192] [128, 1]]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this case should be considered. However, I'd leave out such change in this revision until the logic of new sizes/strides (see the other comments below) is confirmed correct.

Comment on lines +218 to +227
auto getNewL2Strides = [&](SmallVector<int64_t> values) {
SmallVector<OpFoldResult> res = {getAsIndexOpFoldResult(ctx, 1)};
int64_t initial = values.back();
// Leave out one dimension for insertion afterwards
for (size_t i = values.size() - 2; i > 0; i--) {
initial *= values[i];
res.push_back(getAsIndexOpFoldResult(ctx, initial));
}
return llvm::to_vector(llvm::reverse(res));
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain how to calculate the new strides? I don't understand how it's just initial *= values[i];?

Copy link
Contributor Author

@yzhang93 yzhang93 Sep 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take the following dma ops for example:

%45 = amdaie.npu.circular_dma_cpy_nd %8([0] [2048] [1], [] [] [])
%46 = amdaie.npu.dma_cpy_nd %8([] [] [], %31[0, 0, 0, %41] [4, 2, 32, 32] [4096, 32, 128, 1])

The logic is to create L2 side strides from the innermost dimension, and then reverse the vector to have the final order. The new L2 side strides always start with [1], and should have the same number of dimensions as the original L3 side source addressing. The next dimensions are calculated by the logic initial *= l3OrigSizes[i].

The initial means the innermost continuous elements which is l3OrigSizes[-1]* l3OrigStrides.back[-1] (the implementation omit l3OrigStrides[-1] because l3OrigStrides[-1] == 1). The combined elements are now continuous on L3 side, but should have a strided addressing on L2 side, the stride should be initial * l3OrigSizes[-2]. So after this iteration, the strides are [1, 32 * 32].

Same logic for the next iteration, the strides become [1, 32 * 32, 32 * 32 * 2] = [1, 1024, 2048]. After reversion, it's [2048, 1024, 1]. At last insert the stride for the position of the combined dimension (e.g., index 1 in this example), which is l3OrigStride[dimForCombine], and get final strides [2048, 32, 1024, 1].

Let me know if this is the correct logic to get the L2 side strides, or if there's a better way to calculate this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The combined elements are now continuous on L3 side, but should have a strided addressing on L2 side, the stride should be initial * l3OrigSizes[-2].

Yeah, I think the core idea is good: i.e. that the strides should be created as if the original L3 was contiguous and then rearranged based on which dimension(s) are combined with the innermost one.

However, I do think this needs extensive tests to ensure correctness as this will otherwise lead to hard-debug numerical errors in the future. So, it would be good to create a standalone utility function that takes in a set of static offsets/sizes/strides and produces the new static L3 and L2 offsets/sizes/strides, so that it can be tested standalone (ctest, not lit) on a lot of different cases, see for example: https://github.com/nod-ai/iree-amd-aie/blob/main/compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/test/AMDAIEDmaUtilsTest.cpp

circularDma.getTargetMixedStrides();

// Change the source/target addressing of all users from a connection op.
for (Operation *user : connectionOp->getUsers()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if different NPU DMA users have different strides/sizes/offsets?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if we would have such cases. I looked through the current IR, and only find the case that the connection op has multiple NpuDmaCpyNdOp users (just with different offsets) and one NpuCircularDmaCpyNdOp user.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do see it with peeled matmul and regardless, we should check for it and return/abort.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants