-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ConvertToDma] Add options to tranpose dma dimensions on target side #812
Conversation
I would prefer a separate pass for this. The main reason is to allow for better optimization. The heurisitic "do transpose on side with fewer dims" isn't necessarily best. For example, after canonicalization, the side with fewer dims might change (contiguous dimensions can be eliminated). A secondary reason is to keep ConvertToDma as simple as possible (although this PR doesn't add much complexity). What I have in mind is a pass
which basically converts from one form to the other. I haven't thought about how hard that is. That said, this change is nice and small so I'm happy to accept this. |
@@ -387,32 +403,35 @@ void AMDAIEConvertToDmaPass::runOnOperation() { | |||
// step. This is easy to implement, but not the most direct lowering, so | |||
// we might want to revisit this. | |||
WalkResult convertCopiesWalkResult = | |||
getOperation()->walk([&rewriter](linalg::CopyOp copyOp) { | |||
getOperation()->walk([&](linalg::CopyOp copyOp) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change? My preference is
[&rewriter] > [&] > [&, this]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there's need to pass &rewriter
. And I'm trying to follow the coding styles in llvm-project which just pass [&] in such walking functions, e.g. https://github.com/llvm/llvm-project/blob/e1e788f423b5c780c40912ab102b0a3c4b92b9de/mlir/lib/Dialect/SCF/Transforms/ForallToFor.cpp#L63
The goal of this PR is not trying to balance the dma dimensions. I think there's no reason to keep all transposed dimensions on the source side as the original codes does, because it is just one of the four combinations for dma generation (packTransposeOnSource, packTransposeOnTarget, unpackTransposeOnSource, unpackTransposeOnTarget). I don't see any benefit to convert the dma with all transposed dimensions on source side and then use another pass to make the source side continuous and move the strided pattern on the target side (as what I did previously in #792). |
The case I have in mind is a dma copy between 2 memrefs, one which is contiguous and one which is not. It might be better then to do the transposes on the side which is contiguous, because the side that is not contiguous will already need DMAs, because the dimensions can't collapse. I'm confident I can write down examples of packs and unpacks where the transpose must be done on either source or target side to ensure you don't run out of dimensions... and probably cases with a mix where 1 dimension is transposed on source and 1 is on target. I just think that we might need a separate pass with better analysis after this PR lands. But for now, I'm ok with this if it gets us some of the way there. |
I am not convinced that we should create a separate pass for rebalancing. Ideally, we do it here in pack to DMA conversion as the logic seems the most straightforward at this point. The alternative of a separate pass gets quite complex as @yzhang93 pointed out: #792. To accommodate different 'strategies' within this pack to dma conversion pass we can work with different options, like what @yzhang93 did, and I think that will get us quite far already. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
Introduced and unaddressed in #812
) 1. This PR relaxes the condition for circular dma ops loop subsumption, so that npu.circular_dma_cpy_nd ops can be hoisted out of the loop even if there is other npu.dma_cpy_nd user of the same connection op after it. 2. With this change, we can further subsume loops and hoist npu.dma_cpy_nd ops out of the loop. This PR makes use of #812 and brings the dma optimizations in Passes.cpp.
… side (#850) Previous PR #812 added support to transpose dma dimensions on the target side for control code optimization. However, this new dma addressing wasn't supported in `SplitLogicalObjectFifosForConnectionReuse` pass. This PR keeps the original logic and most of the original codes while adding the support for the new dma format.
Pack/unpack ops change the data layout and thus after converting to dma ops, the dma addressing dimensions are expanded/collapsed and transposed. Previously, all the dimension transpositions are on the source side of dma ops. This PR extends the usage to have an option for transposition happen on the target side.
In applications, we could make choices of transposition on source or target for pack or unpack ops based on performance and hardware dma requirements, etc. The motivation comes from this discussion, and this PR moves the dma optimization logic to an early pass where the dma ops are converted.
Note the default options are not changed in this PR (will enable it in a separate PR with other changes for dma optimization), but I have tested all four combinations locally to make sure the dma generations are correct and work e2e. The change of options can be added for example as