Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TOPI] Vectorize depthwise conv2d output operator #14519

Merged
merged 1 commit into from
May 3, 2023

Conversation

FranklandJack
Copy link
Contributor

Depthwise Conv2D operations may consists of a convolution + an output operator e.g. Relu. This commit will:

  • Apply vectorization across the inner channel loop when there is an output operator.
  • Remove some unused variables in schedule_depthwise_conv2d_nhwc.

@tvm-bot
Copy link
Collaborator

tvm-bot commented Apr 6, 2023

Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.

  • No users to tag found in teams: topi See #10317 for details

Generated by tvm-bot

Copy link
Contributor

@echuraev echuraev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your PR. Should we add a new test case?

@@ -394,7 +394,8 @@ def schedule_conv_out(out):
ci_outer, ci_inner = s[out].split(ci, 4)
s[out].vectorize(ci_inner)
s[out].unroll(ci_outer)

else:
s[out].vectorize(ci)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't be the value of ci limited somehow? What if the value of ci is huge? You can use filter argument in define_split to prevent this situation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. I'm not sure what would happen if ci was huge, I guess LLVM would legalize the vectors and loop over the size of the vector registers in hardware when they are storing 32bit floats.

I sort of just copied this from above though using the same scheduling that we use for the actual convolution.

Since the fallback scheduling is to use a loop splitting where ci = 8 and for the hardware in question this is the number of 32bit float in a vector I think we should be okay. In the case we aren't using the fallback i.e. auto-scheduling I would hope the auto-scheduler is able to find the optimal choice for ci which should be at least as good as 8 or better, but perhaps there are scenarios where it could choose a bad value?

We could do something similar to how we handle regular convolutions, and define a tunable parameter for vectorization, but that would potentially change the scheduling for the depthwise convolution here, rather than its output, which is sort of beyond the initial scope of this PR.

What are your thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your answer. I think that to avoid guesses about the size of ci (are use sure that it will be equal to 8?) you can just add a filter parameter to define split, e.g.:

cfg.define_split("tile_c", c, num_outputs=2, filter=lambda entry: entry.size[1] <= 32)

In this case I believe we can guarantee that ci won't be bigger than 32. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are already defining a split here though, this is where I got the value of 8 from.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This split will work only for run w/o tuning statistic (in default mode). I suggest restricting the search space in tuning by adding filter value in define_split. It will help to improve tuning time and dropout useless configurations.
By the way, I reread your message and noticed that probably I was wrong than suggest to restrict vectorization size by 32. You wrote that it is storing 32bit floats, that means that the maximum possible capacity for vectorization is 8, am I right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks for the suggestion, I've added a filter on the split.

Yeah I think you are almost correct, for neon where vectors are 128 bits and sizeof(float) = 32, then there will be 128 / 32 = 4 elements processed in each vector operation. I think the backend will split the 8 element llvm vectors into two 4 element vectors during legalization, so this isn't a problem.

@FranklandJack
Copy link
Contributor Author

Thank you for your PR. Should we add a new test case?

Thanks for the speedy review :D

Yeah, although I'm not sure where a test for something like this would live. We would need to test that vectorization happens as expected, so some kind of lit test based on tir?

@echuraev
Copy link
Contributor

echuraev commented Apr 7, 2023

Yeah, although I'm not sure where a test for something like this would live. We would need to test that vectorization happens as expected, so some kind of lit test based on tir?

I believe that a test in test_conv2d.py might be enough. Probably you can also add a test for tir or codegen to see that vectorization was applied but in this case in codegen test it will be necessary to call an existing schedule instead of writing a new one.

@FranklandJack
Copy link
Contributor Author

Yeah, although I'm not sure where a test for something like this would live. We would need to test that vectorization happens as expected, so some kind of lit test based on tir?

I believe that a test in test_conv2d.py might be enough. Probably you can also add a test for tir or codegen to see that vectorization was applied but in this case in codegen test it will be necessary to call an existing schedule instead of writing a new one.

Sorry I've been on annual leave and just got back to reviewing this. I took a look at test_conv2d.py, although would a more appropriate place for this not be test_depthwise_conv2d.py. I think at present though we don't do any testing for Arm Cortex A targets (see here), so perhaps this is outside the scope of this PR?

@echuraev
Copy link
Contributor

I think at present though we don't do any testing for Arm Cortex A targets (see here), so perhaps this is outside the scope of this PR?

I thought that this schedule is a common depthwise conv2d schedule for ARM and doesn't matter is it Cortex A or any other target, am I right? But anyway, I'm ok if we merge this PR w/o test, because it is pretty simple and probably writing a good test which will check generated tir, will require too much efford.

@FranklandJack
Copy link
Contributor Author

I think at present though we don't do any testing for Arm Cortex A targets (see here), so perhaps this is outside the scope of this PR?

I thought that this schedule is a common depthwise conv2d schedule for ARM and doesn't matter is it Cortex A or any other target, am I right? But anyway, I'm ok if we merge this PR w/o test, because it is pretty simple and probably writing a good test which will check generated tir, will require too much efford.

Sorry, yes you are correct, other arm CPU targets will also go through this path. Although for the test we currently only test the cortex-M target with the aot cornerstone 300 runner, which I think means we will always go the CMSIS-NN path.

I think you are probably right and we can merge this without a test, although you've identified we are missing test coverage for these target schedules, so I think after this is merged I'll follow up with another PR that fixes that.

Depthwise Conv2D operations may consists of a convolution + an output
operator e.g. Relu. This commit will:
* Apply vectorization across the inner channel loop when there is an
  output operator.
* Remove some unused variables in `schedule_depthwise_conv2d_nhwc`.
* Limit the loop splitting to 8 elements in the inner loop.
@FranklandJack FranklandJack force-pushed the depthwise-conv2d-out-vec branch from 7278fa9 to 3946cf5 Compare April 19, 2023 14:25
Copy link
Contributor

@lhutton1 lhutton1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix @FranklandJack! I agree with the comments about testing, perhaps we can create a separate issue to look at improving the testing coverage?

@FranklandJack
Copy link
Contributor Author

Thanks for the fix @FranklandJack! I agree with the comments about testing, perhaps we can create a separate issue to look at improving the testing coverage?

Cool, I've create #14759 to track this.

Copy link
Contributor

@echuraev echuraev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry. I forgot about this PR. I'm ok about a separate issue. Let's merge this PR.

@echuraev echuraev merged commit cc8cce5 into apache:main May 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants