Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TOPI] Vectorize depthwise conv2d output operator #14519

Merged
merged 1 commit into from
May 3, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 6 additions & 5 deletions python/tvm/topi/arm_cpu/depthwise_conv2d.py
Original file line number Diff line number Diff line change
Expand Up @@ -292,13 +292,13 @@ def schedule_depthwise_conv2d_nhwc(cfg, outs):
out = outs[0]

##### space definition begin #####
n, h, w, c = s[out].op.axis
_, h, w, c = s[out].op.axis
# Split the number of input/output channels
cfg.define_split("tile_c", c, num_outputs=2)
cfg.define_split("tile_c", c, num_outputs=2, filter=lambda entry: entry.size[1] <= 8)
# Split the height of the convolution
_, hi = cfg.define_split("tile_h", h, num_outputs=2)
cfg.define_split("tile_h", h, num_outputs=2)
# Split the width of the convolution
_, wi = cfg.define_split("tile_w", w, num_outputs=2)
cfg.define_split("tile_w", w, num_outputs=2)
# Additional out (e.g., requantization, bias addition, etc..)
# 0: locate the output on the second last axis of the main compuation
# 1: locate the output closest to the main computation
Expand Down Expand Up @@ -394,7 +394,8 @@ def schedule_conv_out(out):
ci_outer, ci_inner = s[out].split(ci, 4)
s[out].vectorize(ci_inner)
s[out].unroll(ci_outer)

else:
s[out].vectorize(ci)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't be the value of ci limited somehow? What if the value of ci is huge? You can use filter argument in define_split to prevent this situation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. I'm not sure what would happen if ci was huge, I guess LLVM would legalize the vectors and loop over the size of the vector registers in hardware when they are storing 32bit floats.

I sort of just copied this from above though using the same scheduling that we use for the actual convolution.

Since the fallback scheduling is to use a loop splitting where ci = 8 and for the hardware in question this is the number of 32bit float in a vector I think we should be okay. In the case we aren't using the fallback i.e. auto-scheduling I would hope the auto-scheduler is able to find the optimal choice for ci which should be at least as good as 8 or better, but perhaps there are scenarios where it could choose a bad value?

We could do something similar to how we handle regular convolutions, and define a tunable parameter for vectorization, but that would potentially change the scheduling for the depthwise convolution here, rather than its output, which is sort of beyond the initial scope of this PR.

What are your thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your answer. I think that to avoid guesses about the size of ci (are use sure that it will be equal to 8?) you can just add a filter parameter to define split, e.g.:

cfg.define_split("tile_c", c, num_outputs=2, filter=lambda entry: entry.size[1] <= 32)

In this case I believe we can guarantee that ci won't be bigger than 32. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are already defining a split here though, this is where I got the value of 8 from.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This split will work only for run w/o tuning statistic (in default mode). I suggest restricting the search space in tuning by adding filter value in define_split. It will help to improve tuning time and dropout useless configurations.
By the way, I reread your message and noticed that probably I was wrong than suggest to restrict vectorization size by 32. You wrote that it is storing 32bit floats, that means that the maximum possible capacity for vectorization is 8, am I right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks for the suggestion, I've added a filter on the split.

Yeah I think you are almost correct, for neon where vectors are 128 bits and sizeof(float) = 32, then there will be 128 / 32 = 4 elements processed in each vector operation. I think the backend will split the 8 element llvm vectors into two 4 element vectors during legalization, so this isn't a problem.

fused_n_ho = s[out].fuse(n, ho)
return hi, wi, fused_n_ho

Expand Down