-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TOPI] Vectorize depthwise conv2d output operator #14519
Conversation
Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.
Generated by tvm-bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your PR. Should we add a new test case?
@@ -394,7 +394,8 @@ def schedule_conv_out(out): | |||
ci_outer, ci_inner = s[out].split(ci, 4) | |||
s[out].vectorize(ci_inner) | |||
s[out].unroll(ci_outer) | |||
|
|||
else: | |||
s[out].vectorize(ci) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't be the value of ci
limited somehow? What if the value of ci
is huge? You can use filter
argument in define_split
to prevent this situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point. I'm not sure what would happen if ci
was huge, I guess LLVM would legalize the vectors and loop over the size of the vector registers in hardware when they are storing 32bit floats.
I sort of just copied this from above though using the same scheduling that we use for the actual convolution.
Since the fallback scheduling is to use a loop splitting where ci = 8
and for the hardware in question this is the number of 32bit float in a vector I think we should be okay. In the case we aren't using the fallback i.e. auto-scheduling I would hope the auto-scheduler is able to find the optimal choice for ci
which should be at least as good as 8
or better, but perhaps there are scenarios where it could choose a bad value?
We could do something similar to how we handle regular convolutions, and define a tunable parameter for vectorization, but that would potentially change the scheduling for the depthwise convolution here, rather than its output, which is sort of beyond the initial scope of this PR.
What are your thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your answer. I think that to avoid guesses about the size of ci
(are use sure that it will be equal to 8
?) you can just add a filter parameter to define split, e.g.:
cfg.define_split("tile_c", c, num_outputs=2, filter=lambda entry: entry.size[1] <= 32)
In this case I believe we can guarantee that ci
won't be bigger than 32. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are already defining a split here though, this is where I got the value of 8
from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This split will work only for run w/o tuning statistic (in default mode). I suggest restricting the search space in tuning by adding filter value in define_split
. It will help to improve tuning time and dropout useless configurations.
By the way, I reread your message and noticed that probably I was wrong than suggest to restrict vectorization size by 32. You wrote that it is storing 32bit floats, that means that the maximum possible capacity for vectorization is 8, am I right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, thanks for the suggestion, I've added a filter on the split.
Yeah I think you are almost correct, for neon where vectors are 128
bits and sizeof(float) = 32
, then there will be 128 / 32 = 4
elements processed in each vector operation. I think the backend will split the 8 element llvm vectors into two 4 element vectors during legalization, so this isn't a problem.
Thanks for the speedy review :D Yeah, although I'm not sure where a test for something like this would live. We would need to test that vectorization happens as expected, so some kind of lit test based on tir? |
I believe that a test in test_conv2d.py might be enough. Probably you can also add a test for tir or codegen to see that vectorization was applied but in this case in codegen test it will be necessary to call an existing schedule instead of writing a new one. |
Sorry I've been on annual leave and just got back to reviewing this. I took a look at test_conv2d.py, although would a more appropriate place for this not be test_depthwise_conv2d.py. I think at present though we don't do any testing for Arm Cortex A targets (see here), so perhaps this is outside the scope of this PR? |
I thought that this schedule is a common depthwise conv2d schedule for ARM and doesn't matter is it Cortex A or any other target, am I right? But anyway, I'm ok if we merge this PR w/o test, because it is pretty simple and probably writing a good test which will check generated tir, will require too much efford. |
Sorry, yes you are correct, other arm CPU targets will also go through this path. Although for the test we currently only test the cortex-M target with the aot cornerstone 300 runner, which I think means we will always go the CMSIS-NN path. I think you are probably right and we can merge this without a test, although you've identified we are missing test coverage for these target schedules, so I think after this is merged I'll follow up with another PR that fixes that. |
Depthwise Conv2D operations may consists of a convolution + an output operator e.g. Relu. This commit will: * Apply vectorization across the inner channel loop when there is an output operator. * Remove some unused variables in `schedule_depthwise_conv2d_nhwc`. * Limit the loop splitting to 8 elements in the inner loop.
7278fa9
to
3946cf5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix @FranklandJack! I agree with the comments about testing, perhaps we can create a separate issue to look at improving the testing coverage?
Cool, I've create #14759 to track this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry. I forgot about this PR. I'm ok about a separate issue. Let's merge this PR.
Depthwise Conv2D operations may consists of a convolution + an output operator e.g. Relu. This commit will:
schedule_depthwise_conv2d_nhwc
.