[TOPI] Vectorize depthwise conv2d output operator #14519

FranklandJack · 2023-04-06T14:12:32Z

Depthwise Conv2D operations may consists of a convolution + an output operator e.g. Relu. This commit will:

Apply vectorization across the inner channel loop when there is an output operator.
Remove some unused variables in schedule_depthwise_conv2d_nhwc.

tvm-bot · 2023-04-06T14:12:36Z

Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.

No users to tag found in teams: topi _{See #10317 for details}

_{Generated by tvm-bot}

echuraev

Thank you for your PR. Should we add a new test case?

echuraev · 2023-04-06T14:45:13Z

python/tvm/topi/arm_cpu/depthwise_conv2d.py

@@ -394,7 +394,8 @@ def schedule_conv_out(out):
            ci_outer, ci_inner = s[out].split(ci, 4)
            s[out].vectorize(ci_inner)
            s[out].unroll(ci_outer)
-
+        else:
+            s[out].vectorize(ci)


Shouldn't be the value of ci limited somehow? What if the value of ci is huge? You can use filter argument in define_split to prevent this situation.

This is a good point. I'm not sure what would happen if ci was huge, I guess LLVM would legalize the vectors and loop over the size of the vector registers in hardware when they are storing 32bit floats.

I sort of just copied this from above though using the same scheduling that we use for the actual convolution.

Since the fallback scheduling is to use a loop splitting where ci = 8 and for the hardware in question this is the number of 32bit float in a vector I think we should be okay. In the case we aren't using the fallback i.e. auto-scheduling I would hope the auto-scheduler is able to find the optimal choice for ci which should be at least as good as 8 or better, but perhaps there are scenarios where it could choose a bad value?

We could do something similar to how we handle regular convolutions, and define a tunable parameter for vectorization, but that would potentially change the scheduling for the depthwise convolution here, rather than its output, which is sort of beyond the initial scope of this PR.

What are your thoughts?

Thank you for your answer. I think that to avoid guesses about the size of ci (are use sure that it will be equal to 8?) you can just add a filter parameter to define split, e.g.:

cfg.define_split("tile_c", c, num_outputs=2, filter=lambda entry: entry.size[1] <= 32)

In this case I believe we can guarantee that ci won't be bigger than 32. What do you think?

I think we are already defining a split here though, this is where I got the value of 8 from.

This split will work only for run w/o tuning statistic (in default mode). I suggest restricting the search space in tuning by adding filter value in define_split. It will help to improve tuning time and dropout useless configurations.
By the way, I reread your message and noticed that probably I was wrong than suggest to restrict vectorization size by 32. You wrote that it is storing 32bit floats, that means that the maximum possible capacity for vectorization is 8, am I right?

Cool, thanks for the suggestion, I've added a filter on the split.

Yeah I think you are almost correct, for neon where vectors are 128 bits and sizeof(float) = 32, then there will be 128 / 32 = 4 elements processed in each vector operation. I think the backend will split the 8 element llvm vectors into two 4 element vectors during legalization, so this isn't a problem.

FranklandJack · 2023-04-06T15:00:36Z

Thank you for your PR. Should we add a new test case?

Thanks for the speedy review :D

Yeah, although I'm not sure where a test for something like this would live. We would need to test that vectorization happens as expected, so some kind of lit test based on tir?

echuraev · 2023-04-07T07:32:06Z

Yeah, although I'm not sure where a test for something like this would live. We would need to test that vectorization happens as expected, so some kind of lit test based on tir?

I believe that a test in test_conv2d.py might be enough. Probably you can also add a test for tir or codegen to see that vectorization was applied but in this case in codegen test it will be necessary to call an existing schedule instead of writing a new one.

FranklandJack · 2023-04-18T14:13:44Z

Yeah, although I'm not sure where a test for something like this would live. We would need to test that vectorization happens as expected, so some kind of lit test based on tir?

I believe that a test in test_conv2d.py might be enough. Probably you can also add a test for tir or codegen to see that vectorization was applied but in this case in codegen test it will be necessary to call an existing schedule instead of writing a new one.

Sorry I've been on annual leave and just got back to reviewing this. I took a look at test_conv2d.py, although would a more appropriate place for this not be test_depthwise_conv2d.py. I think at present though we don't do any testing for Arm Cortex A targets (see here), so perhaps this is outside the scope of this PR?

echuraev · 2023-04-19T06:31:30Z

I think at present though we don't do any testing for Arm Cortex A targets (see here), so perhaps this is outside the scope of this PR?

I thought that this schedule is a common depthwise conv2d schedule for ARM and doesn't matter is it Cortex A or any other target, am I right? But anyway, I'm ok if we merge this PR w/o test, because it is pretty simple and probably writing a good test which will check generated tir, will require too much efford.

FranklandJack · 2023-04-19T09:55:42Z

I think at present though we don't do any testing for Arm Cortex A targets (see here), so perhaps this is outside the scope of this PR?

I thought that this schedule is a common depthwise conv2d schedule for ARM and doesn't matter is it Cortex A or any other target, am I right? But anyway, I'm ok if we merge this PR w/o test, because it is pretty simple and probably writing a good test which will check generated tir, will require too much efford.

Sorry, yes you are correct, other arm CPU targets will also go through this path. Although for the test we currently only test the cortex-M target with the aot cornerstone 300 runner, which I think means we will always go the CMSIS-NN path.

I think you are probably right and we can merge this without a test, although you've identified we are missing test coverage for these target schedules, so I think after this is merged I'll follow up with another PR that fixes that.

Depthwise Conv2D operations may consists of a convolution + an output operator e.g. Relu. This commit will: * Apply vectorization across the inner channel loop when there is an output operator. * Remove some unused variables in `schedule_depthwise_conv2d_nhwc`. * Limit the loop splitting to 8 elements in the inner loop.

lhutton1

Thanks for the fix @FranklandJack! I agree with the comments about testing, perhaps we can create a separate issue to look at improving the testing coverage?

FranklandJack · 2023-05-03T16:44:50Z

Thanks for the fix @FranklandJack! I agree with the comments about testing, perhaps we can create a separate issue to look at improving the testing coverage?

Cool, I've create #14759 to track this.

echuraev

I'm sorry. I forgot about this PR. I'm ok about a separate issue. Let's merge this PR.

echuraev reviewed Apr 6, 2023

View reviewed changes

FranklandJack force-pushed the depthwise-conv2d-out-vec branch from 7278fa9 to 3946cf5 Compare April 19, 2023 14:25

lhutton1 approved these changes May 3, 2023

View reviewed changes

echuraev approved these changes May 3, 2023

View reviewed changes

echuraev merged commit cc8cce5 into apache:main May 3, 2023

ysh329 mentioned this pull request Jul 12, 2023

[Release] v0.13.0 Release Candidate Notes #15295

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TOPI] Vectorize depthwise conv2d output operator #14519

[TOPI] Vectorize depthwise conv2d output operator #14519

FranklandJack commented Apr 6, 2023

tvm-bot commented Apr 6, 2023

echuraev left a comment

echuraev Apr 6, 2023

FranklandJack Apr 6, 2023

echuraev Apr 7, 2023

FranklandJack Apr 18, 2023

echuraev Apr 19, 2023

FranklandJack Apr 19, 2023

FranklandJack commented Apr 6, 2023

echuraev commented Apr 7, 2023

FranklandJack commented Apr 18, 2023

echuraev commented Apr 19, 2023

FranklandJack commented Apr 19, 2023

lhutton1 left a comment

FranklandJack commented May 3, 2023

echuraev left a comment

[TOPI] Vectorize depthwise conv2d output operator #14519

[TOPI] Vectorize depthwise conv2d output operator #14519

Conversation

FranklandJack commented Apr 6, 2023

tvm-bot commented Apr 6, 2023

echuraev left a comment

Choose a reason for hiding this comment

echuraev Apr 6, 2023

Choose a reason for hiding this comment

FranklandJack Apr 6, 2023

Choose a reason for hiding this comment

echuraev Apr 7, 2023

Choose a reason for hiding this comment

FranklandJack Apr 18, 2023

Choose a reason for hiding this comment

echuraev Apr 19, 2023

Choose a reason for hiding this comment

FranklandJack Apr 19, 2023

Choose a reason for hiding this comment

FranklandJack commented Apr 6, 2023

echuraev commented Apr 7, 2023

FranklandJack commented Apr 18, 2023

echuraev commented Apr 19, 2023

FranklandJack commented Apr 19, 2023

lhutton1 left a comment

Choose a reason for hiding this comment

FranklandJack commented May 3, 2023

echuraev left a comment

Choose a reason for hiding this comment