-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify the operand layout support of conv2d and pooling 2d operations #324
Comments
This CL implements the DefineXnnNodeForConv2d() method for WebNN conv2d MLOperator that defines XNNPACK conv2d or depthwise conv2d Node according to MLConv2dOptions. This CL only supports the input layout “nhwc”, filter layout “ohwi” for regular conv2d and filter layout “ihow” for depthwise conv2d. Other input and filter layouts could be supported by inserting transpose operators later. There is also another proposal [1] that suggests simplifying the layouts support. The implementation will be updated according to the WG’s consensus. For unit tests, this CL implements Conv2dTest of MLGraphXnnpackTest that tests against conv2d, depthwise conv2d with and without fused bias operand and relu activation. [1]: webmachinelearning/webnn#324 Bug: 1273291 Change-Id: I8a70ee7bb053b386e12ff46e67d139683b044383 Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/4031876 Reviewed-by: Jiewei Qian <qjw@chromium.org> Commit-Queue: ningxin hu <ningxin.hu@intel.com> Cr-Commit-Position: refs/heads/main@{#1093328}
This CL implements DefineXnnNodeForPool2d() method that defines XNNPACK pooling Nodes for averagePool2d MLOperator by xnn_define_average_pooling_2d() or xnn_define_global_average_pooling_2d(), and maxPool2d MLOperator by xnn_define_max_pooling_2d(). Similar to conv2d, this CL only supports the “nhwc” input layout. The “nchw” input layout could be supported by inserting transpose operators later. There is a proposal [1] that suggests simplifying the layout's support. The implementation will be updated according to the WG’s consensus. To consolidate the padding sizes setting and calculation, this CL also introduces the GetXnnPadding2D() helper that is shared by both XNNPACK convolution 2d and pooling 2d Nodes. For unit tests, this CL implements the Pool2dTest of MLGraphXnnpackTest that tests both kinds of pooling operators, including the global average pooling variant. [1]: webmachinelearning/webnn#324 Bug: 1273291 Change-Id: I16ac7260b96762078f6fb00997504e7ef32067da Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/4031307 Commit-Queue: ningxin hu <ningxin.hu@intel.com> Reviewed-by: Jiewei Qian <qjw@chromium.org> Cr-Commit-Position: refs/heads/main@{#1095962}
This issue was discussed at the WebML WG Teleconference – 16 March 2023. Summary: Awaits further implementation feedback. |
Picking just one preferred layout in WebNN could make life easier for the calling framework and the underlying backend implementation, or it could make it harder for both:
I prefer accepting both (keeping the current spec), but it would be informative to see a holistic table each major framework's preferred layout and each backend's preferred layout. [updated...] Table added (✅ == default):
|
@fdwr thanks for sharing your preference and the supporting details. As an aside, I encourage incorporating considerations such as this into specification informatively alongside the normative prose. It helps explain the specification to people who look at it without the full context active WG participants have. |
Layout support comes up in MLOperand implementation that allows data shape broadcasting. https://chromium-review.googlesource.com/c/chromium/src/+/4396686/comment/f02acaeb_3c2795f2/ Supporting both channel-first and channel-last layout will complicate spec steps and implementation because the current numpy broadcast rule implies right-most first broadcast. Example: caller wants to apply a per-channel multiplication.
How to support case 1 isn't clear. Some questions might help decision:
I have a slight preference for supporting only one layout (NHWC to be precise).
|
**Description**: This PR intends to enable WebNN EP in ONNX Runtime Web. It translates the ONNX nodes by [WebNN API](https://webmachinelearning.github.io/webnn/), which is implemented in C++ and uses Emscripten [Embind API](https://emscripten.org/docs/porting/connecting_cpp_and_javascript/embind.html#). Temporarily using preferred layout **NHWC** for WebNN graph partitions since the restriction in WebNN XNNPack backend implementation and the ongoing [discussion](webmachinelearning/webnn#324) in WebNN spec that whether WebNN should support both 'NHWC' and 'NCHW' layouts. No WebNN native EP, only for Web. **Motivation and Context**: Allow ONNXRuntime Web developers to access WebNN API to benefit from hardware acceleration. **WebNN API Implementation Status in Chromium**: - Tracked in Chromium issue: [#1273291](https://bugs.chromium.org/p/chromium/issues/detail?id=1273291) - **CPU device**: based on XNNPack backend, and had been available on Chrome Canary M112 behind "#enable-experimental-web-platform-features" flag for Windows and Linux platforms. Further implementation for more ops is ongoing. - **GPU device**: based on DML, implementation is ongoing. **Open**: - GitHub CI: WebNN currently is only available on Chrome Canary/Dev with XNNPack backend for Linux and Windows. This is an open to reviewers to help identify which GitHub CI should involved the WebNN EP and guide me to enable it. Thanks!
**Description**: This PR intends to enable WebNN EP in ONNX Runtime Web. It translates the ONNX nodes by [WebNN API](https://webmachinelearning.github.io/webnn/), which is implemented in C++ and uses Emscripten [Embind API](https://emscripten.org/docs/porting/connecting_cpp_and_javascript/embind.html#). Temporarily using preferred layout **NHWC** for WebNN graph partitions since the restriction in WebNN XNNPack backend implementation and the ongoing [discussion](webmachinelearning/webnn#324) in WebNN spec that whether WebNN should support both 'NHWC' and 'NCHW' layouts. No WebNN native EP, only for Web. **Motivation and Context**: Allow ONNXRuntime Web developers to access WebNN API to benefit from hardware acceleration. **WebNN API Implementation Status in Chromium**: - Tracked in Chromium issue: [#1273291](https://bugs.chromium.org/p/chromium/issues/detail?id=1273291) - **CPU device**: based on XNNPack backend, and had been available on Chrome Canary M112 behind "#enable-experimental-web-platform-features" flag for Windows and Linux platforms. Further implementation for more ops is ongoing. - **GPU device**: based on DML, implementation is ongoing. **Open**: - GitHub CI: WebNN currently is only available on Chrome Canary/Dev with XNNPack backend for Linux and Windows. This is an open to reviewers to help identify which GitHub CI should involved the WebNN EP and guide me to enable it. Thanks!
I want to share a data point. I was playing with Real-ESRGAN today, and found out that I'm not sure how well this transfer to other models (ESRGAN is heavily based on CNN + residual connection) though. I wonder if we should benchmark on channel ordering on different hardware (i.e. vendor other than NVIDIA could optimize for channel_first). Or maybe this won't matter if graph builder (or rather optimizer) is "clever" enough. |
There is a security perspective from @quidity (Thanks Alex!) in Chromium CL-4653303: WebNN: Define conv2d operator in mojo review. Alex mentioned:
|
FWIW, another way to tackle layout is to tell the implementation which layout should be used, like: https://pytorch.org/tutorials/intermediate/memory_format_tutorial.html This could be a hint to -- Taking a step back, I still strongly prefer a single unified layout (i.e. NHWC) that's applied throughout MLGraphBuilder methods, and let the backend (e.g. DMLImpl) change the layout (if necessary) before sending to hardware. |
From the end users' perspective, I sympathize with the single default layout idea, selecting the most supported one in the industry, and let the back-end make any changes needed, making layout an implementation detail. End users might need to convert layout in a few corner cases. However, from an API design standpoint, there is also the question what the clients of this API will want to control, i.e. what the API should expose. A user-facing API is more free to make simplifications by conventions or by re-framing the user interaction. In the comment above there are arguments the single default layout might also simplify the usage of the API. When not sure, in Web APIs usually it is a good practice to start with the simpler API and then extend it on need, making sure extensibility is possible by design, i.e. no API breaks. |
The design intent of WebNN as a backend API prioritizes completeness, efficiency, and expressiveness over ease of use. For instance, automatic shape inference is not supported as it was assumed to be the responsibility of the calling framework or app calling into WebNN directly. This limitation while not as easy to use allows the API to be more flexible and adaptable to different framework policies. I agree with the premise that having excessive layout options makes the API harder to implement. I think reducing the filter layout options However, a trickier conversation is on the input layout. Interestingly, the first option "nchw" is also the default in torch and ONNX while the second option "nhwc" is supported natively in TensorFlow, historically influenced by the design of NVIDIA Tensor Core's FP16 native layout in 2017, started with the Volta generation -- the Titan-V. It isn't a mainstream layout supported by all other vendors, just a very common one with NVIDIA GPU over FP16 tensor data type. There are TensorFlow models nowadays that still rely on NHWC input layout. These models once converted to the other format often results in each conv2d layer bracketed by a pair of transposes, a superfluous outcome at first glance, but easily collapsible by an intelligent backend later on. On the other hand, allowing NHWC layout to populate down through the graph, could potentially push the layout mismatches down further in the stack and makes it harder for the implementer to detect and optimize the unneeded double transposes away. I support the removal of the input layout enum |
This change introduces a new section for Algorithms, following APIs, to collect algorithms referenced throughout the specification. A section for Broadcasting is introduced, which defines broadcasting shapes and gives an explicit algorithm matching WebNN implementations of NumPy's General Broadcasting Rules. Definitions for "broadcastable" and "unidirectionally broadcastable" are introduced. The previous definition of "broadcast-shapes" is removed in favor of these new algorithms. For webmachinelearning#324, webmachinelearning#462, and potentially webmachinelearning#523.
This change introduces a new section for Algorithms, following APIs, to collect algorithms referenced throughout the specification. A section for Broadcasting is introduced, which defines broadcasting shapes and gives an explicit algorithm matching WebNN implementations of NumPy's General Broadcasting Rules. Definitions for "broadcastable" and "unidirectionally broadcastable" are introduced. The previous definition of "broadcast-shapes" is removed in favor of these new algorithms. For webmachinelearning#324, webmachinelearning#378, webmachinelearning#462, and potentially webmachinelearning#523.
This change introduces a new section for Algorithms, following APIs, to collect algorithms referenced throughout the specification. A section for Broadcasting is introduced, which defines broadcasting shapes and gives an explicit algorithm matching WebNN implementations of NumPy's General Broadcasting Rules. Definitions for "broadcastable" and "unidirectionally broadcastable" are introduced. The previous definition of "broadcast-shapes" is removed in favor of these new algorithms. For webmachinelearning#324, webmachinelearning#378, webmachinelearning#462, and potentially webmachinelearning#523.
This change introduces a new section for Algorithms, following APIs, to collect algorithms referenced throughout the specification. A section for Broadcasting is introduced, which defines broadcasting shapes and gives an explicit algorithm matching WebNN implementations of NumPy's General Broadcasting Rules. Definitions for "broadcastable" and "unidirectionally broadcastable" are introduced. The previous definition of "broadcast-shapes" is removed in favor of these new algorithms. For webmachinelearning#324, webmachinelearning#378, webmachinelearning#462, and potentially webmachinelearning#523.
This change introduces a new section for Algorithms, following APIs, to collect algorithms referenced throughout the specification. A section for Broadcasting is introduced, which defines broadcasting shapes and gives an explicit algorithm matching WebNN implementations of NumPy's General Broadcasting Rules. Definitions for "broadcastable" and "unidirectionally broadcastable" are introduced. The previous definition of "broadcast-shapes" is removed in favor of these new algorithms. For webmachinelearning#324, webmachinelearning#378, webmachinelearning#462, and potentially webmachinelearning#523.
This change introduces a new section for Algorithms, following APIs, to collect algorithms referenced throughout the specification. A section for Broadcasting is introduced, which defines broadcasting shapes and gives an explicit algorithm matching WebNN implementations of NumPy's General Broadcasting Rules. Definitions for "broadcastable" and "unidirectionally broadcastable" are introduced. The previous definition of "broadcast-shapes" is removed in favor of these new algorithms. For webmachinelearning#324, webmachinelearning#378, webmachinelearning#462, and potentially webmachinelearning#523.
This change introduces a new section for Algorithms, following APIs, to collect algorithms referenced throughout the specification. A section for Broadcasting is introduced, which defines broadcasting shapes and gives an explicit algorithm matching WebNN implementations of NumPy's General Broadcasting Rules. Definitions for "broadcastable" and "unidirectionally broadcastable" are introduced. The previous definition of "broadcast-shapes" is removed in favor of these new algorithms. For webmachinelearning#324, webmachinelearning#378, webmachinelearning#462, and potentially webmachinelearning#523.
This change introduces a new section for Algorithms, following APIs, to collect algorithms referenced throughout the specification. A section for Broadcasting is introduced, which defines broadcasting shapes and gives an explicit algorithm matching WebNN implementations of NumPy's General Broadcasting Rules. Definitions for "broadcastable" and "unidirectionally broadcastable" are introduced. The previous definition of "broadcast-shapes" is removed in favor of these new algorithms. Use broadcasting definition in expand(), rather than bespoke steps For webmachinelearning#324, webmachinelearning#378, webmachinelearning#462, and potentially webmachinelearning#523. Co-authored-by: Dwayne Robinson <dwayner@microsoft.com>
This change introduces a new section for Algorithms, following APIs, to collect algorithms referenced throughout the specification. A section for Broadcasting is introduced, which defines broadcasting shapes and gives an explicit algorithm matching WebNN implementations of NumPy's General Broadcasting Rules. Definitions for "broadcastable" and "unidirectionally broadcastable" are introduced. The previous definition of "broadcast-shapes" is removed in favor of these new algorithms. Use broadcasting definition in expand(), rather than bespoke steps For webmachinelearning#324, webmachinelearning#378, webmachinelearning#462, and potentially webmachinelearning#523. Co-authored-by: Dwayne Robinson <dwayner@microsoft.com>
* New content: Add definition for shape broadcasting This change introduces a new section for Algorithms, following APIs, to collect algorithms referenced throughout the specification. A section for Broadcasting is introduced, which defines broadcasting shapes and gives an explicit algorithm matching WebNN implementations of NumPy's General Broadcasting Rules. Definitions for "broadcastable" and "unidirectionally broadcastable" are introduced. The previous definition of "broadcast-shapes" is removed in favor of these new algorithms. Use broadcasting definition in expand(), rather than bespoke steps For #324, #378, #462, and potentially #523. Co-authored-by: Dwayne Robinson <dwayner@microsoft.com> * Fix prelu parameter order --------- Co-authored-by: Dwayne Robinson <dwayner@microsoft.com>
A summary of Chromium prototyping:
|
Discussion in WebML WG meeting at TPAC, resolved to close. https://www.w3.org/2024/09/23-webmachinelearning-minutes.html#t05 |
I don't believe any spec updates are needed here, so closing, but please re-open if I missed something @huningxin ! |
We need to remove pooling's rounding direction for the output dimensions and just use dictionary MLPool2dOptions : MLOperatorOptions {
sequence<[EnforceRange] unsigned long> windowDimensions;
sequence<[EnforceRange] unsigned long> padding;
sequence<[EnforceRange] unsigned long> strides;
sequence<[EnforceRange] unsigned long> dilations;
MLInputOperandLayout layout = "nchw";
- MLRoundingType roundingType = "floor";
sequence<[EnforceRange] unsigned long> outputSizes;
}; https://www.w3.org/TR/webnn/#api-mlgraphbuilder-pool2d-average |
In the existing WebNN spec, conv2d supports two input operand layouts defined by MLInputOperandLayout and four filter operand layouts defined by MLConv2dFilterOperandLayout.
This may make the implementation more complicated especially if a native ML framework or OS API doesn't support some of these layouts. If one layout is unsupported, the implementation may need to insert the
transpose
operations into the graph around theconv2d
operation that transposes the unsupported layout to supported one. This would easily lead to an inefficient graph representation that may have redundanttranspose
operations. Or the implementation may need to optimize the graph by techniques such as "transpose sink" which may require a more complex implementation. This issue was raised in Chromium CL review.To simplify the implementation, the proposal is to reduce the supported operand layouts, for example, just keep the default one. Because WebNN supports
transpose
operation, the layout adaption and graph level optimization can be handled by ML frameworks that usually already support such functionalities.Thanks @wacky6 for this idea.
The text was updated successfully, but these errors were encountered: