-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Arm DSP implementation of Depthwise Conv2D #12448
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def depthwise_conv2d_nhwc_dsp_compute(cfg, data, kernel, strides, padding, dilation, out_dtype): | ||
"""Compute function for v7e-m DSP instructions of DepthwiseConv2D. Has a lot of requirements | ||
for use - not not all apply, the fallback implementation will be used instead.""" | ||
assert isinstance(strides, int) or len(strides) == 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we just use type annotations for this, or use them in addition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be open to switching over to type annotations, but this is the style followed by all other schedules in topi/arm_cpu/mprofile/dsp
. IMO we should make a new PR to do this for all dsp
schedules, but I'm open to suggestions.
2533396
to
27e0663
Compare
Unit tests have been added and verified to work - the PR is now ready for review. Thanks @areusch for your preliminary look! |
cc @leandron @ekalda @ashutosh-arm could you have a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @guberti, it's great to see specialised schedules for M class cores! Also, bonus points for abundant documentation 🏅
I am not an M class expert, but I took a look and looks very good to me in general.
I spotted some criticism around CMSIS-NN not doing the depthwises in CHW - I did some digging and learned that it has been attempted for non-DSP cores and it resulted in a performance degradation in networks because rescaling from int32 to int8 became a bottleneck. There was an asterisk though that the performance should be better for DSP enabled cores, which is what is supported in this patch (but in defense of the CMSIS-NN folk, CMSIS-NN is an embedded library with size constraints, so the kernels have to work for a wide variety of cores). So out of interest, did you look at how this schedule performs in multi operator networks? But anyway, I think this is a very welcome addition to TVM :)
python/tvm/topi/arm_cpu/mprofile/dsp/micro_kernel/quad_channel_convolve.py
Outdated
Show resolved
Hide resolved
Thanks for the review @ekalda - I've addressed your comments. As for performance in multi-operator networks, I haven't done a ton of looking, as there are still performance improvements I'll make to this schedule in a follow-up PR (scroll up for more details on what these are). That said, I have verified that this change improves performance. When combined with the bugfix in #12671, this PR decreases the total runtime of MobileNet V1 0.25 on a Cortex-M4 Nucleo board by almost 10%. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @guberti for all the hard work on this!
ebe529b
to
65f4204
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good to me. Left few comments for my understanding.
quad_channel_convolve_impl, | ||
) | ||
|
||
# For depthwise_conv2d, kernels are normally given in HWOI format, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome comment 💯
python/tvm/topi/arm_cpu/mprofile/dsp/micro_kernel/quad_channel_convolve.py
Outdated
Show resolved
Hide resolved
python/tvm/topi/arm_cpu/mprofile/dsp/micro_kernel/quad_channel_convolve.py
Show resolved
Hide resolved
data_layout = tvm.testing.parameter("NHWC") | ||
dtype = tvm.testing.parameter("int8") | ||
kernel_layout = tvm.testing.parameter("HWOI") | ||
schedule_name = tvm.testing.parameter("depthwise_conv2d_nhwc_dsp.arm_cpu") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what the basic infra supports in this test suite, but does it offer infra to test negative / invalid cases where the schedule of this PR is not invoked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, we don't have infrastructure to test invalid cases :(. It's definitely worth addressing in a future PR, though.
Thanks for the comments @ashutosh-arm! They should be addressed by 617e0b1. |
617e0b1
to
d5d4caa
Compare
depthwise_conv2d kernel re-arranging fast bytecode for dsp copy/modify helper code Bugfixes from code testing Much of the depthwise conv2d schedule V1 DSP DWC2D black formatting Minor work to address comments
Functional DWC2D schedule with test Code cleanup and linting Fix padding to match Relay and add tests Fix test cases
d5d4caa
to
33fadf8
Compare
Currently, our microTVM implementation of
depthwise_conv2d
uses the fallback schedule, and performance is subsequently terrible. This change adds a schedule for certain cases ofdepthwise_conv2d
when it is run on a Cortex M4 or M7 based chip (though I mainly thought about the M4). Almost all of the "big" performance speedups have been implemented, which should make our implementation faster than TFLite Micro and comparable to CMSIS-NN:int8
values from the kernel and input tensor at a time. This is the main source of our speedup.__SMLAD
instruction to compute convolutions for four channels at once.stride>1
, pads the kernel asymmetrically to slightly reduce the size of the padded tensor.However, in the interest of merging a PR I did not implement a few other optimizations. The most important one is that this schedule is not autotunable in any meaningful way (besides reordering a few loops). In an ideal world, we would use custom knobs to allow reordering of the instructions inside
QUAD_CHANNEL_TENSOR_REARRANGE_SUM_DSP
(e.g. do we load the kernel from memory first, or perform halfword packs on our input tensor first?). This would improve performance on the M4 by a little bit, but I suspect would improve M7 performance a lot.Additionally, I would have liked to handle the edges of the convolution with strip mining, instead of by padding the input tensor. This padding requires copying the entire tensor, and is therefor slow, but support for strip mining in TVM is pretty bad. A few other desired improvements:
QUAD_CHANNEL_TENSOR_REARRANGE_SUM_DSP
for the entry in kernels with an odd number of entries (e.g. 3x3 kernels)I'm marking this PR as a draft for now, since there currently aren't any tests. Thanks to Andrew for his help with explaining how TVM does scheduling!