Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Arm DSP implementation of Depthwise Conv2D #12448

Merged
merged 6 commits into from
Sep 8, 2022

Conversation

guberti
Copy link
Member

@guberti guberti commented Aug 16, 2022

Currently, our microTVM implementation of depthwise_conv2d uses the fallback schedule, and performance is subsequently terrible. This change adds a schedule for certain cases of depthwise_conv2d when it is run on a Cortex M4 or M7 based chip (though I mainly thought about the M4). Almost all of the "big" performance speedups have been implemented, which should make our implementation faster than TFLite Micro and comparable to CMSIS-NN:

  • Performs 4x fewer memory loads than the fallback implementation by loading four int8 values from the kernel and input tensor at a time. This is the main source of our speedup.
  • Uses a hand-written assembly micro-kernel utilizing the __SMLAD instruction to compute convolutions for four channels at once.
  • Uses a specialized kernel packing to remove four assembly instructions from the micro kernel.
  • When stride>1, pads the kernel asymmetrically to slightly reduce the size of the padded tensor.

However, in the interest of merging a PR I did not implement a few other optimizations. The most important one is that this schedule is not autotunable in any meaningful way (besides reordering a few loops). In an ideal world, we would use custom knobs to allow reordering of the instructions inside QUAD_CHANNEL_TENSOR_REARRANGE_SUM_DSP (e.g. do we load the kernel from memory first, or perform halfword packs on our input tensor first?). This would improve performance on the M4 by a little bit, but I suspect would improve M7 performance a lot.

Additionally, I would have liked to handle the edges of the convolution with strip mining, instead of by padding the input tensor. This padding requires copying the entire tensor, and is therefor slow, but support for strip mining in TVM is pretty bad. A few other desired improvements:

  • Custom knobs for reordering instructions in micro kernel
  • Replace tensor padding with strip mining or something else
  • Use a specialized version of QUAD_CHANNEL_TENSOR_REARRANGE_SUM_DSP for the entry in kernels with an odd number of entries (e.g. 3x3 kernels)
  • Generalize the micro kernel to support kernel sizes beyond 3x3
  • Similar to the above, remove other restrictions on the use of this micro kernel (e.g. support kernel dilation)
  • Allow requantization and ReLU instructions to be fused in a way that's not slow

I'm marking this PR as a draft for now, since there currently aren't any tests. Thanks to Andrew for his help with explaining how TVM does scheduling!

Copy link
Contributor

@areusch areusch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great work @guberti ! here are a couple early suggestions. @Mousius @leandron would you guys be up for reviewing this one?

python/tvm/relay/op/strategy/arm_cpu.py Outdated Show resolved Hide resolved
python/tvm/topi/arm_cpu/mprofile/dsp/depthwise_conv2d.py Outdated Show resolved Hide resolved
python/tvm/topi/arm_cpu/mprofile/dsp/depthwise_conv2d.py Outdated Show resolved Hide resolved
python/tvm/topi/arm_cpu/mprofile/dsp/depthwise_conv2d.py Outdated Show resolved Hide resolved
def depthwise_conv2d_nhwc_dsp_compute(cfg, data, kernel, strides, padding, dilation, out_dtype):
"""Compute function for v7e-m DSP instructions of DepthwiseConv2D. Has a lot of requirements
for use - not not all apply, the fallback implementation will be used instead."""
assert isinstance(strides, int) or len(strides) == 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we just use type annotations for this, or use them in addition?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be open to switching over to type annotations, but this is the style followed by all other schedules in topi/arm_cpu/mprofile/dsp. IMO we should make a new PR to do this for all dsp schedules, but I'm open to suggestions.

python/tvm/topi/arm_cpu/mprofile/dsp/depthwise_conv2d.py Outdated Show resolved Hide resolved
python/tvm/topi/arm_cpu/mprofile/dsp/depthwise_conv2d.py Outdated Show resolved Hide resolved
python/tvm/topi/arm_cpu/mprofile/dsp/depthwise_conv2d.py Outdated Show resolved Hide resolved
python/tvm/topi/arm_cpu/mprofile/dsp/depthwise_conv2d.py Outdated Show resolved Hide resolved
@guberti guberti changed the title Add Arm DSP implentation of Depthwise Conv2D Add Arm DSP implementation of Depthwise Conv2D Aug 25, 2022
@guberti guberti force-pushed the micro/depthwise_conv2d_dsp branch 2 times, most recently from 2533396 to 27e0663 Compare August 31, 2022 16:09
@guberti
Copy link
Member Author

guberti commented Aug 31, 2022

Unit tests have been added and verified to work - the PR is now ready for review. Thanks @areusch for your preliminary look!

@guberti guberti marked this pull request as ready for review August 31, 2022 18:33
@areusch
Copy link
Contributor

areusch commented Aug 31, 2022

cc @leandron @ekalda @ashutosh-arm could you have a look?

Copy link
Contributor

@ekalda ekalda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @guberti, it's great to see specialised schedules for M class cores! Also, bonus points for abundant documentation 🏅

I am not an M class expert, but I took a look and looks very good to me in general.

I spotted some criticism around CMSIS-NN not doing the depthwises in CHW - I did some digging and learned that it has been attempted for non-DSP cores and it resulted in a performance degradation in networks because rescaling from int32 to int8 became a bottleneck. There was an asterisk though that the performance should be better for DSP enabled cores, which is what is supported in this patch (but in defense of the CMSIS-NN folk, CMSIS-NN is an embedded library with size constraints, so the kernels have to work for a wide variety of cores). So out of interest, did you look at how this schedule performs in multi operator networks? But anyway, I think this is a very welcome addition to TVM :)

@guberti
Copy link
Member Author

guberti commented Sep 1, 2022

Thanks for the review @ekalda - I've addressed your comments. As for performance in multi-operator networks, I haven't done a ton of looking, as there are still performance improvements I'll make to this schedule in a follow-up PR (scroll up for more details on what these are).

That said, I have verified that this change improves performance. When combined with the bugfix in #12671, this PR decreases the total runtime of MobileNet V1 0.25 on a Cortex-M4 Nucleo board by almost 10%.

Copy link
Contributor

@areusch areusch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @guberti for all the hard work on this!

Copy link
Contributor

@ashutosh-arm ashutosh-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me. Left few comments for my understanding.

python/tvm/relay/op/strategy/arm_cpu.py Outdated Show resolved Hide resolved
python/tvm/relay/op/strategy/arm_cpu.py Show resolved Hide resolved
quad_channel_convolve_impl,
)

# For depthwise_conv2d, kernels are normally given in HWOI format,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome comment 💯

python/tvm/topi/arm_cpu/mprofile/dsp/depthwise_conv2d.py Outdated Show resolved Hide resolved
data_layout = tvm.testing.parameter("NHWC")
dtype = tvm.testing.parameter("int8")
kernel_layout = tvm.testing.parameter("HWOI")
schedule_name = tvm.testing.parameter("depthwise_conv2d_nhwc_dsp.arm_cpu")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what the basic infra supports in this test suite, but does it offer infra to test negative / invalid cases where the schedule of this PR is not invoked?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, we don't have infrastructure to test invalid cases :(. It's definitely worth addressing in a future PR, though.

@guberti
Copy link
Member Author

guberti commented Sep 2, 2022

Thanks for the comments @ashutosh-arm! They should be addressed by 617e0b1.

depthwise_conv2d kernel re-arranging

fast bytecode for dsp

copy/modify helper code

Bugfixes from code testing

Much of the depthwise conv2d schedule

V1 DSP DWC2D

black formatting

Minor work to address comments
Functional DWC2D schedule with test

Code cleanup and linting

Fix padding to match Relay and add tests

Fix test cases
@areusch areusch merged commit 62bdc91 into apache:main Sep 8, 2022
xinetzone pushed a commit to daobook/tvm that referenced this pull request Nov 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants