Dim order in delegates - dim order tagging and partition dim order in/out #8333

GregoryComer · 2025-02-10T09:14:41Z

GregoryComer
Feb 10, 2025
Collaborator

Problem Statement

I'm looking at two dim-order related use cases for the XNNPACK delegate, and am wondering if/how the recent dim order work done in core fits into this. The first use case is to pass graph inputs in as channels last. This is often the format that the image data is naturally in, and it saves the user needing manually convert before passing into ET. It also allows for skipping a dim order conversion before resize when doing bilinear resize in XNNPACK, which is key to keep memory down when taking full-size images as ideally the "raw" input tensor should go directly into the resize. This prevents any huge internal activation tensors from being allocated.

The second use case is supporting per-op mode partitioning (one partition per operator) in XNNPACK for channels-last ops. Per-op mode already exists for NCHW ops, and allows for all tensors to be owned and memory planned by ET. The XNNPACK delegate requires many vision ops to be done in channels-last for efficient kernels, including convolutions. Currently, the XNNPACK delegate asserts that all partition inputs and outputs are standard dim order / channels-first. This means that dim order conversions are inserted around every conv op in per-op mode.

In per_op=False mode, there might be a single partition that looks like this:

_to_copy(memory_order=torch.channels_last)
conv
relu
conv
relu
_to_copy(memory_order=torch.contiguous_format)

However, with per_op=True, it looks like this. Note that excess to_copy nodes surrounding each convolution.

[partition1]
_to_copy(memory_order=torch.channels_last)
conv
_to_copy(memory_order=torch.contiguous_format)
[partition2]
relu
[partition3]
_to_copy(memory_order=torch.channels_last)
conv
_to_copy(memory_order=torch.contiguous_format)
[partition4]
relu

Potential Solutions

Because these issues involve dim order across partition boundaries, it is difficult to solve solely in the delegate. That's where I hope that the framework dim order support can help, or be extended to help. As a backend author, I'd ideally like to be able to tag each partition with an input (and output) dim order. Then, there is a post-delegation pass that inserts the appropriate dim order conversions.

On a second note, for better dim order support, it would be nice to be able to get the dim order of a tensor from metadata, using the new dim order facilities, rather than just looking at the tensor memory_format. Ideally, it would be automatically populated during graph retracing as to stay updated after each pass. Perhaps some sort of dim order spec in the node meta.

Are either or both of these currently possible with the ET dim order work? If not, is there any objection do adding the partition dim order tagging and post-delegation dim order conversion pass I proposed above? I don't have a full technical proposal yet - I'm mostly interested in understanding whether there is any existing machinery to meet this need, and if not, is my proposed approach reasonable?

CC @Gasoonjia @mcr229 @digantdesai

digantdesai · 2025-02-10T23:56:49Z

digantdesai
Feb 10, 2025
Collaborator

We have discussed this internally, and in my opinion, we currently have sufficient dim_order tools and APIs to address this issue both locally (within a single delegate graph) and globally (across multiple delegate subgraphs).

Just so you know, the Edge IR requires every single node in the Graph (that include delegates nodes, portable/optimized kernels etc.) to support dim_order corresponding to NCHW and NHWC for 4D tensor, and so on for 3D, and 5D. ET core already supports this. Most of the portable ops supports this. And almost all delegates are compatible with it (i.e. will fail gracefully if they can't handle it). The dialect verifier today should flag nodes with torch.memory_format as illegal.

That said, in terms of optimal number of permutes, here's how I envision this playing out:

The next logical step is for delegates to utilize dim-order APIs to implement locally optimal lowering, which involves determining the optimal number of permutations for a given input dim-order, output dim-order, and subgraph.
Once this is achieved, it should be relatively straightforward to extend this approach to achieve global semi-optimality, where all delegates perform local optimizations, and portable ops handling dim_order without conversions.
The final step would involve ET providing a pass that performs global optimization, taking into account input/output dim_order, delegate node types in the graph, and inserting the optimal number of dim_order conversions based on each delegate's preferences (which is assumed to be static).

Note: (1) is not in the scope of the dim_order workstream. Arm and XNNPACK both have plans to implement this in H1. (2) should be trivial. Once this is done, we can look at (3). For (3), we can solve using some simple heuristics at first, i.e. for a graph with all XNNPACK delegates, and NCHW input/output, just need to make 1st XNNPACK delegate output as NHWC and last XNNPACK delegate's output as NCHW, and so on. It can be similarly done using portable ops as well.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dim order in delegates - dim order tagging and partition dim order in/out #8333

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Dim order in delegates - dim order tagging and partition dim order in/out #8333

GregoryComer Feb 10, 2025 Collaborator

Problem Statement

Potential Solutions

Replies: 1 comment

digantdesai Feb 10, 2025 Collaborator

GregoryComer
Feb 10, 2025
Collaborator

digantdesai
Feb 10, 2025
Collaborator