Add Arm DSP implementation of Depthwise Conv2D #12448

guberti · 2022-08-16T07:25:22Z

Currently, our microTVM implementation of depthwise_conv2d uses the fallback schedule, and performance is subsequently terrible. This change adds a schedule for certain cases of depthwise_conv2d when it is run on a Cortex M4 or M7 based chip (though I mainly thought about the M4). Almost all of the "big" performance speedups have been implemented, which should make our implementation faster than TFLite Micro and comparable to CMSIS-NN:

Performs 4x fewer memory loads than the fallback implementation by loading four int8 values from the kernel and input tensor at a time. This is the main source of our speedup.
Uses a hand-written assembly micro-kernel utilizing the __SMLAD instruction to compute convolutions for four channels at once.
Uses a specialized kernel packing to remove four assembly instructions from the micro kernel.
When stride>1, pads the kernel asymmetrically to slightly reduce the size of the padded tensor.

However, in the interest of merging a PR I did not implement a few other optimizations. The most important one is that this schedule is not autotunable in any meaningful way (besides reordering a few loops). In an ideal world, we would use custom knobs to allow reordering of the instructions inside QUAD_CHANNEL_TENSOR_REARRANGE_SUM_DSP (e.g. do we load the kernel from memory first, or perform halfword packs on our input tensor first?). This would improve performance on the M4 by a little bit, but I suspect would improve M7 performance a lot.

Additionally, I would have liked to handle the edges of the convolution with strip mining, instead of by padding the input tensor. This padding requires copying the entire tensor, and is therefor slow, but support for strip mining in TVM is pretty bad. A few other desired improvements:

Custom knobs for reordering instructions in micro kernel
Replace tensor padding with strip mining or something else
Use a specialized version of QUAD_CHANNEL_TENSOR_REARRANGE_SUM_DSP for the entry in kernels with an odd number of entries (e.g. 3x3 kernels)
Generalize the micro kernel to support kernel sizes beyond 3x3
Similar to the above, remove other restrictions on the use of this micro kernel (e.g. support kernel dilation)
Allow requantization and ReLU instructions to be fused in a way that's not slow

I'm marking this PR as a draft for now, since there currently aren't any tests. Thanks to Andrew for his help with explaining how TVM does scheduling!

areusch

great work @guberti ! here are a couple early suggestions. @Mousius @leandron would you guys be up for reviewing this one?

python/tvm/relay/op/strategy/arm_cpu.py

python/tvm/topi/arm_cpu/mprofile/dsp/depthwise_conv2d.py

areusch · 2022-08-18T21:59:58Z

python/tvm/topi/arm_cpu/mprofile/dsp/depthwise_conv2d.py

+def depthwise_conv2d_nhwc_dsp_compute(cfg, data, kernel, strides, padding, dilation, out_dtype):
+    """Compute function for v7e-m DSP instructions of DepthwiseConv2D. Has a lot of requirements
+    for use - not not all apply, the fallback implementation will be used instead."""
+    assert isinstance(strides, int) or len(strides) == 2


should we just use type annotations for this, or use them in addition?

I'd be open to switching over to type annotations, but this is the style followed by all other schedules in topi/arm_cpu/mprofile/dsp. IMO we should make a new PR to do this for all dsp schedules, but I'm open to suggestions.

python/tvm/topi/arm_cpu/mprofile/dsp/depthwise_conv2d.py

guberti · 2022-08-31T18:33:45Z

Unit tests have been added and verified to work - the PR is now ready for review. Thanks @areusch for your preliminary look!

areusch · 2022-08-31T20:36:17Z

cc @leandron @ekalda @ashutosh-arm could you have a look?

ekalda

Great work @guberti, it's great to see specialised schedules for M class cores! Also, bonus points for abundant documentation 🏅

I am not an M class expert, but I took a look and looks very good to me in general.

I spotted some criticism around CMSIS-NN not doing the depthwises in CHW - I did some digging and learned that it has been attempted for non-DSP cores and it resulted in a performance degradation in networks because rescaling from int32 to int8 became a bottleneck. There was an asterisk though that the performance should be better for DSP enabled cores, which is what is supported in this patch (but in defense of the CMSIS-NN folk, CMSIS-NN is an embedded library with size constraints, so the kernels have to work for a wide variety of cores). So out of interest, did you look at how this schedule performs in multi operator networks? But anyway, I think this is a very welcome addition to TVM :)

tests/python/relay/strategy/arm_cpu/test_depthwise_conv2d.py

python/tvm/topi/arm_cpu/mprofile/dsp/micro_kernel/quad_channel_convolve.py

guberti · 2022-09-01T18:02:09Z

Thanks for the review @ekalda - I've addressed your comments. As for performance in multi-operator networks, I haven't done a ton of looking, as there are still performance improvements I'll make to this schedule in a follow-up PR (scroll up for more details on what these are).

That said, I have verified that this change improves performance. When combined with the bugfix in #12671, this PR decreases the total runtime of MobileNet V1 0.25 on a Cortex-M4 Nucleo board by almost 10%.

areusch

thanks @guberti for all the hard work on this!

ashutosh-arm

Overall looks good to me. Left few comments for my understanding.

python/tvm/relay/op/strategy/arm_cpu.py

ashutosh-arm · 2022-09-02T15:06:40Z

python/tvm/topi/arm_cpu/mprofile/dsp/depthwise_conv2d.py

+    quad_channel_convolve_impl,
+)
+
+# For depthwise_conv2d, kernels are normally given in HWOI format,


Awesome comment 💯

python/tvm/topi/arm_cpu/mprofile/dsp/depthwise_conv2d.py

python/tvm/topi/arm_cpu/mprofile/dsp/micro_kernel/quad_channel_convolve.py

ashutosh-arm · 2022-09-02T15:20:44Z

tests/python/relay/strategy/arm_cpu/test_depthwise_conv2d.py

+    data_layout = tvm.testing.parameter("NHWC")
+    dtype = tvm.testing.parameter("int8")
+    kernel_layout = tvm.testing.parameter("HWOI")
+    schedule_name = tvm.testing.parameter("depthwise_conv2d_nhwc_dsp.arm_cpu")


I am not sure what the basic infra supports in this test suite, but does it offer infra to test negative / invalid cases where the schedule of this PR is not invoked?

Unfortunately, we don't have infrastructure to test invalid cases :(. It's definitely worth addressing in a future PR, though.

guberti · 2022-09-02T18:02:51Z

Thanks for the comments @ashutosh-arm! They should be addressed by 617e0b1.

depthwise_conv2d kernel re-arranging fast bytecode for dsp copy/modify helper code Bugfixes from code testing Much of the depthwise conv2d schedule V1 DSP DWC2D black formatting Minor work to address comments

Functional DWC2D schedule with test Code cleanup and linting Fix padding to match Relay and add tests Fix test cases

areusch reviewed Aug 18, 2022

View reviewed changes

guberti changed the title ~~Add Arm DSP implentation of Depthwise Conv2D~~ Add Arm DSP implementation of Depthwise Conv2D Aug 25, 2022

guberti force-pushed the micro/depthwise_conv2d_dsp branch 2 times, most recently from 2533396 to 27e0663 Compare August 31, 2022 16:09

guberti marked this pull request as ready for review August 31, 2022 18:33

ekalda reviewed Sep 1, 2022

View reviewed changes

tests/python/relay/strategy/arm_cpu/test_depthwise_conv2d.py Outdated Show resolved Hide resolved

python/tvm/topi/arm_cpu/mprofile/dsp/micro_kernel/quad_channel_convolve.py Outdated Show resolved Hide resolved

areusch approved these changes Sep 1, 2022

View reviewed changes

areusch mentioned this pull request Sep 1, 2022

[Relay] Change when int8 operations are converted to int16 on Arm #12671

Merged

guberti force-pushed the micro/depthwise_conv2d_dsp branch from ebe529b to 65f4204 Compare September 2, 2022 05:05

ashutosh-arm approved these changes Sep 2, 2022

View reviewed changes

guberti force-pushed the micro/depthwise_conv2d_dsp branch from 617e0b1 to d5d4caa Compare September 5, 2022 14:11

guberti added 6 commits September 7, 2022 02:22

Add depthwise conv2d schedule for Cortex-M DSP

27d72dc

depthwise_conv2d kernel re-arranging fast bytecode for dsp copy/modify helper code Bugfixes from code testing Much of the depthwise conv2d schedule V1 DSP DWC2D black formatting Minor work to address comments

Code reorganization and testing

2dcb889

Functional DWC2D schedule with test Code cleanup and linting Fix padding to match Relay and add tests Fix test cases

Add support for fully custom padding

fc7f825

Fix pylint

437dac4

Fix comments on PR

611778c

Address comments from Ashutosh

33fadf8

guberti force-pushed the micro/depthwise_conv2d_dsp branch from d5d4caa to 33fadf8 Compare September 7, 2022 09:22

areusch merged commit 62bdc91 into apache:main Sep 8, 2022

AndrewZhaoLuo mentioned this pull request Oct 4, 2022

TVM v0.10.0.rc0 Release Candidate Notes #12979

Closed

xinetzone pushed a commit to daobook/tvm that referenced this pull request Nov 25, 2022

Add Arm DSP implementation of Depthwise Conv2D (apache#12448)

43442a7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Arm DSP implementation of Depthwise Conv2D #12448

Add Arm DSP implementation of Depthwise Conv2D #12448

guberti commented Aug 16, 2022 •

edited

Loading

areusch left a comment

areusch Aug 18, 2022

guberti Aug 31, 2022

guberti commented Aug 31, 2022

areusch commented Aug 31, 2022

ekalda left a comment

guberti commented Sep 1, 2022

areusch left a comment

ashutosh-arm left a comment

ashutosh-arm Sep 2, 2022

ashutosh-arm Sep 2, 2022

guberti Sep 2, 2022

guberti commented Sep 2, 2022

Add Arm DSP implementation of Depthwise Conv2D #12448

Add Arm DSP implementation of Depthwise Conv2D #12448

Conversation

guberti commented Aug 16, 2022 • edited Loading

areusch left a comment

Choose a reason for hiding this comment

areusch Aug 18, 2022

Choose a reason for hiding this comment

guberti Aug 31, 2022

Choose a reason for hiding this comment

guberti commented Aug 31, 2022

areusch commented Aug 31, 2022

ekalda left a comment

Choose a reason for hiding this comment

guberti commented Sep 1, 2022

areusch left a comment

Choose a reason for hiding this comment

ashutosh-arm left a comment

Choose a reason for hiding this comment

ashutosh-arm Sep 2, 2022

Choose a reason for hiding this comment

ashutosh-arm Sep 2, 2022

Choose a reason for hiding this comment

guberti Sep 2, 2022

Choose a reason for hiding this comment

guberti commented Sep 2, 2022

guberti commented Aug 16, 2022 •

edited

Loading