-
Notifications
You must be signed in to change notification settings - Fork 654
[EXIR] Register _clone_dim_order op and map aten.clone #12971
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EXIR] Register _clone_dim_order op and map aten.clone #12971
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/12971
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below:
✅ No FailuresAs of commit 6839212 with merge base 49bc664 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Left a minor feedback about test coverage!
@@ -389,3 +390,17 @@ def test_mobilenet_v3_xnnpack(self) -> None: | |||
rtol=1e-3, | |||
), | |||
) | |||
|
|||
def test_op_clone_dim_order_registration(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you follow other testcases way (e.g, test_op_dim_order_propagation
), to reuse the memory_format_test_runner function and test at least two different scenarios: one for maintaining dim order, the other is converting the dim order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can definitely add more tests, the reason I didn't here was because the kernel implementation wasn't in place for this PR. I'll work on adding additional coverage.
@pytorchbot label "release notes: none" |
@Gasoonjia splitting the The CI failures are mainly due to two issues:
Do you have any suggestions for how we should handle the current approach of mapping |
Hi @keyprocedure thanks for your update! For places checking the
can you share me more insights here? I think you've create clone_dim_order kernel in aot side by reusing aten clone op. |
@Gasoonjia during export This is also why I placed the tests using What are your thoughts? |
@keyprocedure thanks for detailed reply! let's swap the order of PRs. We can first finish the runtime op, then aot replacement, finally RemoveCloneOpsTransform. |
That's a great idea! Should I move the tests from the runtime/kernel PR (since the new op won't be registered yet) to the aot replacement PR? |
Yes please! More specific: for the runtime pr, it should only contain:
for aot pr, it should contains both graph-level check, and the end2end test using pybinding. |
Hi @Gasoonjia I’ve updated the PRs, the kernel PR is ready for review. The CI failures from this PR that weren’t due to the missing
It seems like these failures share the same root cause: Any insights or guidance would be helpful |
@keyprocedure For the issue in this PR, can you share me the code pointer to the issue? Like which line raising that? |
There were a total of 49 failing tests for the initial CI run, so I listed out the test and line number of each failure:unittest-arm-backend-with-no-fvp (test_pytest_ops) / linux-job19 failures
The CI log also shows multiple TOSA partitioning rejections due to _clone_dim_order not being in BaseTOSASupportList. unittest-arm-backend-with-no-fvp (test_pytest_models) / linux-job30 failures
|
I think we should land the kernel PR before addressing any new failures from the latest CI run in this PR since this PR is dependent on the kernel existing for |
sure let's land prs one by one |
### Summary This is PR 1 of 3 implementing a dim order aware clone op. Currently, clone ops are removed during export as no-ops, causing memory layout (dim order) changes to be lost. This can cause backend failures, incorrect outputs when ops expect specific layouts, and performance degradation. This set of PRs introduces a dim order aware clone op, `_clone_dim_order`, which preserves memory layout changes by explicitly storing dim order information. This is implemented by replacing standard clone ops with this variant during export and updating the clone removal transform to preserve clones that change layout. This PR adds the portable CPU kernel for the `_clone_dim_order` op, implementing a clone variant that preserves dim order at runtime. The portable kernel validates dtype and layout compatibility, resizes the output tensor if needed, and performs an element wise clone of the tensors. Note: A future PR will add the ATen kernel for `_clone_dim_order`. Related PRs: - PR 2: [#12971](#12971) - Register `_clone_dim_order` op and map `aten.clone` - PR 3: [#12976](#12976) - Update RemoveCloneOpsTransform to be dim_order aware Fixes #12645 ### Test plan Added kernel runtime tests to verify: - Tensors of all real dtypes are cloned correctly. - Failure when input and output tensor shapes mismatch. - Failure with unsupported memory formats. - Failure when `non_blocking=true` since the portable kernel only supports blocking data transfer. - Dynamic shape outputs are cloned with correct values. - Layout conversions are cloned correctly for `contiguous` to `channels_last`, `channels_last` to `contiguous`, and `channels_last` is preserved. All runtime tests pass via: `build-ninja/kernels/test/portable_kernels_test` --------- Co-authored-by: Gasoonjia <gasoonjia@meta.com>
@digantdesai it looks like most of the CI failures were due to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, though there're some Arm question maybe we need to wait for @digantdesai for final decision.
@@ -14,7 +14,11 @@ class RemoveClonePass(ExportPass): | |||
"""Remove all clones from graph_module""" | |||
|
|||
def call_operator(self, op, args, kwargs, meta): | |||
if op != exir_ops.edge.aten.clone.default: | |||
clone_ops = ( | |||
exir_ops.edge.aten.clone.default, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should be removing this when producing the edge dialect graph, no
aten.clone will be replaced only when _skip_dim_order = False
. If we do not force Arm
turning on dim order op i would love to keep @keyprocedure's work.
@@ -38,7 +38,7 @@ | |||
] | |||
linear_residual_exir_op: list[str] = [ | |||
"executorch_exir_dialects_edge__ops_aten_gelu_default", | |||
"executorch_exir_dialects_edge__ops_aten_clone_default", | |||
"executorch_exir_dialects_edge__ops_dim_order_ops__clone_dim_order_default", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing aten.clone here because the corresponding tests is dim order only?
@digantdesai will Arm
use dim order
mandatorily? If we want to continue support both aten.clone
and dim_order.clone
, lets keep the original test while have a new test for dim_order
case.
Same as other tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since ARM's pass manager doesn't seem to pass the _skip_dim_order
flag, it defaults to False
, so my strategy was to replace all instances of aten.clone
with _clone_dim_order
in the tests. Although it might make more sense to explicitly set _skip_dim_order=True
in the EdgeCompileConfig
calls in arm_tester and undo my changes to the ARM test files. Then I can either continue with adding _clone_dim_order
support and tests (following Gasoonjia’s suggestion) in TOSA for future use or leave it out for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah my point is we should make our update in line with Arm's target in a consistent way.
That is, if we expect dim-order-only in Arm backend, there should be no aten.clone
in our PR and we should remove aten.clone
supports.
However if we should support dim_order equals to both on and off, we may test both case for clone operator, though I'm ok to split it into several PRs for better structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, thanks for breaking down the options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI see - 135e875
Also cc @oscarandersson8218 for what if no aten.clone support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@digantdesai thanks for looking into this and sharing the enable dim_order
PR. It looks like the best strategy is to replace support for aten.clone
with _clone_dim_order
in ARM since the _skip_dim_order
flag will always be False
, following @Gasoonjia's suggestion.
@@ -67,6 +68,28 @@ def _to_dim_order_copy(context, node): | |||
to(context, node) | |||
|
|||
|
|||
@register_torch_op( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI @metascroy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change looks OK to me, but do we have a test that covers it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a test for _clone_dim_order here. Is this what you had in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@keyprocedure Nice change and thanks for updating our backend!
|
||
|
||
@register_node_visitor | ||
class CloneDimOrderVisitor(NodeVisitor): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If RemoveClonePass works as it should there should be no need for a NodeVisitor as all clones should be removed when we end up here. Also, this node_visitor is not imported in __init__.py
and is therefore not registered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out, I'll remove this NodeVisitor. Just to confirm, if a clone changes dim_order, should the RemoveClonePass still be expected to remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the check in clone_dim_order_support.py check will make sure that we don't partition a clone that changes dim_order so it should be ok to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To expand on this, the plan for dim_order handling in the arm backend is to ignore all delegated dim_order ops (no matter if it changes dim_order or not) and just make sure that the input/output of the delegated sub-graph graph is the correct, since we do our own dim_order handling in https://github.com/pytorch/executorch/blob/main/backends/arm/_passes/annotate_channels_last_dim_order_pass.py. The dtype casting of the to_dim_order_copy is kept though.
We will not delegate partitions containing single dim_order ops to allow the use of dim_order ops in non delegated sub-graphs, similar to this PR: #12995
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really appreciate the context and resources on how dim_order ops are handled. This also helps with diagnosing the graph delegation CI failures.
hi @keyprocedure looks like there're still several issues on ci. Do you mind take a look? |
Hi, yeah, currently working on them. Appreciate the check in :) |
Hi @Gasoonjia, can we try running CI again? I was able to configure the TOSA test setup locally and most of the test failures were due to the input dtype gating that I had in CloneDimOrderSupport, which caused clone ops to not be partitioned, remaining in the subgraph, and failing the tester.check_not tests. |
@keyprocedure Really appreciate your work mate! Started ci and let's see hwo things work! |
@Gasoonjia it looks like we're good! The 3 CI failures seem unrelated:
|
), "Only contiguous memory format is supported in CoreML" | ||
|
||
# Since CoreML only supports contiguous format, no dim_order preservation is needed. Treat this as a no-op clone. | ||
noop(context, node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@keyprocedure Why not just call clone(context, node) here? It will be equivalent to noop now, but let's use clone if that's what the op is.
Just in case coremltools updates clone to be something other than a noop in future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes a lot of sense! But I don't see clone exposed in coremltools for import, it looks like it’s only handled as an alias to noop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is OK then
ci gogogo! |
ci passed. PR LGTM. Thanks for your contribution! @keyprocedure @digantdesai @oscarandersson8218 @metascroy please let me know if you have any concerns, or I will stamp it. |
### Summary This is PR 1 of 3 implementing a dim order aware clone op. Currently, clone ops are removed during export as no-ops, causing memory layout (dim order) changes to be lost. This can cause backend failures, incorrect outputs when ops expect specific layouts, and performance degradation. This set of PRs introduces a dim order aware clone op, `_clone_dim_order`, which preserves memory layout changes by explicitly storing dim order information. This is implemented by replacing standard clone ops with this variant during export and updating the clone removal transform to preserve clones that change layout. This PR adds the portable CPU kernel for the `_clone_dim_order` op, implementing a clone variant that preserves dim order at runtime. The portable kernel validates dtype and layout compatibility, resizes the output tensor if needed, and performs an element wise clone of the tensors. Note: A future PR will add the ATen kernel for `_clone_dim_order`. Related PRs: - PR 2: [pytorch#12971](pytorch#12971) - Register `_clone_dim_order` op and map `aten.clone` - PR 3: [pytorch#12976](pytorch#12976) - Update RemoveCloneOpsTransform to be dim_order aware Fixes pytorch#12645 ### Test plan Added kernel runtime tests to verify: - Tensors of all real dtypes are cloned correctly. - Failure when input and output tensor shapes mismatch. - Failure with unsupported memory formats. - Failure when `non_blocking=true` since the portable kernel only supports blocking data transfer. - Dynamic shape outputs are cloned with correct values. - Layout conversions are cloned correctly for `contiguous` to `channels_last`, `channels_last` to `contiguous`, and `channels_last` is preserved. All runtime tests pass via: `build-ninja/kernels/test/portable_kernels_test` --------- Co-authored-by: Gasoonjia <gasoonjia@meta.com>
Merged! @keyprocedure |
Appreciate all the feedback and reviews, glad we got this landed! |
This reverts commit 130cafc.
Sorry to be the bearer of bad news, but this broke some trunk tests. Let's revert and @keyprocedure can you try to get this in again with the |
@jackzhxng stamped the revert PR. |
Thanks for the update @jackzhxng, I'll resubmit the PR with the additional label. |
money 🤑 |
Summary
This is PR 2 of 3 implementing a dim order aware clone op.
This PR registers the new
_clone_dim_order
op and mapsaten.clone
ops todim_order_ops._clone_dim_order
in EXIR during export to preserve memory layout changes (contiguous/channels_last). It also updates Core ML and ARM backends to handle the new clone op.Related PRs:
_clone_dim_order
portable kernelFixes #12645
Test plan
All tests pass via:
python -m unittest exir.tests.test_memory_format_ops_pass
python -m unittest backends.apple.coreml.test.test_torch_ops
pytest backends/arm/test/ops/test_clone.py
pytest backends/arm/test/passes/test_remove_clone_pass.py