Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add grouped binary convolution support (3/3): indirect BGEMM kernel. #551

Merged
merged 6 commits into from
Jun 7, 2021

Conversation

AdamHillier
Copy link
Contributor

What do these changes do?

This is third of a group of PRs to add support for grouped binary convolutions. I've split the work into three PRs to make review easier.

This PR adds support for grouped convolutions to the optimised indirect BGEMM kernel.

How Has This Been Tested?

This functionality is tested by the kernel tests added in #550. End2End tests are not currently possible because the default enabled kernel is the im2col + BGEMM kernel, which doesn't support grouped convolutions.

Benchmark Results

I'm going to leave this temporarily as a draft because I still need to run some benchmarks to a) work out how much of a speedup using grouped convolutions gives, and b) if there's any impact on the performance of normal non-grouped convolutions.

Related issue number

#549, #550.

@AdamHillier AdamHillier added the feature New feature or request label Oct 23, 2020
@AdamHillier AdamHillier force-pushed the grouped-convolutions-indirect-bgemm branch from 950b8ac to c1676a2 Compare November 5, 2020 00:58
Base automatically changed from grouped-convolutions-reference to master November 6, 2020 14:34
@AdamHillier AdamHillier force-pushed the grouped-convolutions-indirect-bgemm branch 3 times, most recently from 4b19e82 to 3846e33 Compare November 6, 2020 15:47
Copy link
Member

@lgeiger lgeiger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work! I am currious to see the benchmarks 🚀

@AdamHillier
Copy link
Contributor Author

Awesome work! I am currious to see the benchmarks 🚀

So the gist of the benchmarks is that there are lovely speedups when using grouped convolutions, not much worse than the difference in MACs.

However, for non-grouped convolutions I'm recording a small but consistent slowdown, so I need to work out how to fix that before proceeding with this PR. I think it might require a bunch of templating that will ensure that certain instructions are only included for actual grouped convolutions.

I'll keep this a draft until I can resolve those issues.

@AdamHillier AdamHillier force-pushed the grouped-convolutions-indirect-bgemm branch from 3846e33 to 910183a Compare November 10, 2020 17:44
@AdamHillier
Copy link
Contributor Author

I've just pushed to update this branch with the work I did to template the kernel functions and hopefully eliminate the performance regression for non-grouped convolutions. Unfortunately, it doesn't seem to have worked.

Below is a table with all of the ops of the (non-grouped) QuickNetLarge model, running on my Raspberry Pi 4B, using the indirect BGEMM kernels, comparing this PR against the baseline. Also included are the per-op absolute and relative difference in latency. The table is sorted by the absolute latency difference, in decreasing order. Note that the overall model latency of QuickNetLarge with this PR is 1-2 ms slower than with the baseline. The weird thing is that the biggest per-op differences came from six ADD ops, and those ADD ops together account for the entire 1-2 ms slowdown. I really don't know what to make of that, since the model flatbuffer file I'm using is identical for the two runs, and thus the tensors are certainly the same shape, and there's absolutely no reason for this PR to affect the latency of the ADD ops. Conversely, when looking at the rest of the ops it's clear that the LceBConv2d op has approximately unchanged latency, overall. It's very weird.

Op breakdown
Op Baseline PR Difference (absolute) Difference (relative)
ADD 0.598 0.96 0.362 0.605
ADD 0.602 0.959 0.357 0.593
ADD 0.606 0.958 0.352 0.581
ADD 0.604 0.938 0.334 0.553
ADD 0.607 0.936 0.329 0.542
ADD 0.688 0.958 0.270 0.392
CONV_2D 2.023 2.079 0.056 0.028
LceBconv2d 1.66 1.702 0.042 0.025
DEPTHWISE_CONV_2D 0.893 0.931 0.038 0.043
LceQuantize 0.154 0.189 0.035 0.227
LceBconv2d 1.75 1.783 0.033 0.019
LceBconv2d 1.881 1.9 0.019 0.010
LceBconv2d 1.888 1.906 0.018 0.010
LceQuantize 0.137 0.154 0.017 0.124
LceQuantize 0.143 0.159 0.016 0.112
LceQuantize 0.133 0.148 0.015 0.113
LceBconv2d 1.658 1.672 0.014 0.008
LceQuantize 0.134 0.148 0.014 0.104
LceQuantize 0.141 0.152 0.011 0.078
LceBconv2d 1.938 1.948 0.010 0.005
LceQuantize 0.14 0.149 0.009 0.064
LceQuantize 0.081 0.089 0.008 0.099
LceQuantize 0.216 0.223 0.007 0.032
LceQuantize 0.044 0.051 0.007 0.159
LceBconv2d 1.659 1.666 0.007 0.004
LceBconv2d 1.64 1.646 0.006 0.004
LceBconv2d 1.641 1.647 0.006 0.004
LceBconv2d 1.644 1.649 0.005 0.003
LceQuantize 0.219 0.224 0.005 0.023
LceQuantize 0.217 0.221 0.004 0.018
LceBconv2d 1.64 1.644 0.004 0.002
LceBconv2d 1.648 1.652 0.004 0.002
LceBconv2d 1.654 1.658 0.004 0.002
LceBconv2d 1.668 1.671 0.003 0.002
LceBconv2d 1.69 1.693 0.003 0.002
LceBconv2d 1.649 1.652 0.003 0.002
FULLY_CONNECTED 0.672 0.674 0.002 0.003
DEPTHWISE_CONV_2D 0.105 0.106 0.001 0.010
LceQuantize 0.219 0.22 0.001 0.005
LceBconv2d 1.917 1.917 0.000 0.000
LceQuantize 0.222 0.222 0.000 0.000
LceBconv2d 1.644 1.644 0.000 0.000
RESHAPE 0.001 0.001 0.000 0.000
SOFTMAX 0.02 0.02 0.000 0.000
LceBconv2d 1.906 1.905 -0.001 -0.001
CONV_2D 1.666 1.665 -0.001 -0.001
LceBconv2d 1.654 1.653 -0.001 -0.001
LceBconv2d 1.654 1.653 -0.001 -0.001
LceQuantize 0.035 0.034 -0.001 -0.029
AVERAGE_POOL_2D 0.031 0.029 -0.002 -0.065
LceQuantize 0.019 0.017 -0.002 -0.105
LceQuantize 0.019 0.017 -0.002 -0.105
LceBconv2d 1.646 1.644 -0.002 -0.001
LceBconv2d 1.654 1.651 -0.003 -0.002
LceQuantize 0.02 0.017 -0.003 -0.150
LceQuantize 0.039 0.036 -0.003 -0.077
LceQuantize 0.021 0.018 -0.003 -0.143
LceBconv2d 1.693 1.69 -0.003 -0.002
MAX_POOL_2D 0.565 0.561 -0.004 -0.007
LceQuantize 0.043 0.039 -0.004 -0.093
LceQuantize 0.039 0.035 -0.004 -0.103
LceQuantize 0.039 0.035 -0.004 -0.103
LceQuantize 0.039 0.035 -0.004 -0.103
MAX_POOL_2D 1.451 1.447 -0.004 -0.003
LceQuantize 0.04 0.036 -0.004 -0.100
LceBconv2d 1.896 1.891 -0.005 -0.003
LceQuantize 0.044 0.039 -0.005 -0.114
LceQuantize 0.04 0.035 -0.005 -0.125
LceQuantize 0.04 0.035 -0.005 -0.125
LceQuantize 0.04 0.035 -0.005 -0.125
MAX_POOL_2D 0.244 0.239 -0.005 -0.020
LceQuantize 0.041 0.035 -0.006 -0.146
LceBconv2d 1.715 1.709 -0.006 -0.003
LceBconv2d 1.665 1.659 -0.006 -0.004
LceQuantize 0.456 0.45 -0.006 -0.013
ADD 0.084 0.076 -0.008 -0.095
DEPTHWISE_CONV_2D 0.237 0.227 -0.010 -0.042
ADD 0.075 0.064 -0.011 -0.147
ADD 0.053 0.042 -0.011 -0.208
ADD 0.056 0.044 -0.012 -0.214
ADD 0.078 0.065 -0.013 -0.167
ADD 0.077 0.064 -0.013 -0.169
ADD 0.054 0.041 -0.013 -0.241
LceQuantize 0.131 0.117 -0.014 -0.107
ADD 0.081 0.067 -0.014 -0.173
ADD 0.08 0.066 -0.014 -0.175
ADD 0.327 0.313 -0.014 -0.043
ADD 0.056 0.041 -0.015 -0.268
LceBconv2d 1.697 1.682 -0.015 -0.009
ADD 0.082 0.066 -0.016 -0.195
ADD 0.084 0.067 -0.017 -0.202
ADD 0.082 0.065 -0.017 -0.207
ADD 0.081 0.064 -0.017 -0.210
ADD 0.058 0.04 -0.018 -0.310
LceBconv2d 1.698 1.68 -0.018 -0.011
CONV_2D 1.226 1.208 -0.018 -0.015
LceBconv2d 1.67 1.651 -0.019 -0.011
ADD 0.086 0.067 -0.019 -0.221
ADD 0.084 0.065 -0.019 -0.226
ADD 0.24 0.218 -0.022 -0.092
CONV_2D 1.522 1.499 -0.023 -0.015
ADD 0.118 0.094 -0.024 -0.203
ADD 0.249 0.224 -0.025 -0.100
LceBconv2d 1.693 1.666 -0.027 -0.016
ADD 0.246 0.217 -0.029 -0.118
ADD 0.247 0.215 -0.032 -0.130
LceBconv2d 1.686 1.654 -0.032 -0.019
ADD 0.242 0.209 -0.033 -0.136
DEPTHWISE_CONV_2D 0.685 0.652 -0.033 -0.048
ADD 0.255 0.213 -0.042 -0.165
ADD 0.249 0.202 -0.047 -0.189
CONV_2D 4.243 4.078 -0.165 -0.039

@AdamHillier AdamHillier force-pushed the grouped-convolutions-indirect-bgemm branch from 910183a to f2f379f Compare November 11, 2020 16:13
@AdamHillier AdamHillier force-pushed the grouped-convolutions-indirect-bgemm branch from f2f379f to 471b3cb Compare June 4, 2021 09:37
Add support for grouped binary convolutions to the optimised
indirect BGEMM kernel.
@AdamHillier AdamHillier force-pushed the grouped-convolutions-indirect-bgemm branch from 471b3cb to 705dfbf Compare June 4, 2021 09:44
@AdamHillier
Copy link
Contributor Author

I've rebased this PR on top of main, incorporating the TF 2.5 changes.

Assessing non-grouped performance

I'm not sure how, but now the performance regression I previously noticed on the Raspberry Pi 4B has disappeared!

These are measurements taken on my Raspberry Pi 4B (using non-grouped QuickNet models), showing the average latency from 100 runs and the std deviation. It looks like there might still be a slightly slow-down for the larger models, but it's very slight (previously I observed a 1-2 ms slowdown).

Model Latency, indirect BGEMM baseline (ms) Latency, indirect BGEMM this PR (ms)
QuickNet Small 30.02 +- 0.19 29.75 +- 0.13
QuickNet 48.06 +- 0.06 48.37 +- 0.06
QuickNet Large 78.92 +- 0.07 79.36 +- 0.09

In addition, per-op profiling now shows no significant difference between the latencies of the ADD ops, which was the concerning thing I previously saw.

Assessing grouped performance

I generated a version of the QuickNet model used in the table above, but replaced every binary convolution with a grouped binary convolution with two groups. This means that the number of binary MACs is exactly half, giving a theoretical 2x speedup for the binary convolutions.

Here are benchmark results for both the whole model and just the binary convolutions (per-op benchmarks don't report std deviations), running on the same Raspberry Pi 4B device:

Model Latency, whole model (ms) Latency, binary convolutions (ms)
QuickNet (no groups) 48.37 +- 0.06 27.86
QuickNet (two groups) 36.51 +- 0.05 16.56

There is a ~24% whole-model latency decrease, and ~41% binary convolution latency decrease.

Note that it wouldn't be possible to actually speed up the binary convolutions by 2x, because we still have to do the same number of output-transformations whether we use groups or not.

@AdamHillier
Copy link
Contributor Author

From my end this is now good to go - for review, I'd recommend using the 'Hide whitespace changes' (as often suggested by @CNugteren).

@AdamHillier AdamHillier marked this pull request as ready for review June 4, 2021 10:58
@AdamHillier AdamHillier requested a review from a team June 4, 2021 10:59
@lgeiger lgeiger requested review from CNugteren and Tombana June 4, 2021 11:18
Copy link
Member

@lgeiger lgeiger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great 🚀
I am very happy to see that the performance regression solved itself by the upgrade to TF 2.5.

@AdamHillier AdamHillier force-pushed the grouped-convolutions-indirect-bgemm branch from daae283 to 01c3286 Compare June 4, 2021 11:26
Comment on lines +21 to +29
const std::int32_t input_depth;
const std::int32_t output_channels;
const std::int32_t filter_size;
const std::int32_t groups;
const std::int32_t num_output_pixels;

std::vector<TBitpacked> packed_weights;
std::vector<const TBitpacked*> indirection_buffer;
std::vector<TBitpacked> zero_buffer;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've bundled everything related to the runtime kernel into a struct called Kernel, which stores all the information it needs (instead of needing these various vectors to be stored directly in the op_data, which was a bit messy).

Comment on lines 48 to 59
/**
* Pack the weights in the correct order. This procedure is (heavily)
* adapted from the following XNNPack function:
* https://github.com/google/XNNPACK/blob/80a8ac59849bfdae8d2e1409f5642baa502c0b9e/src/packing.c#L429-L484
*/
void PackWeights(const TBitpacked* weights_ptr) {
const std::int32_t input_depth_per_group = input_depth / groups;
const std::int32_t output_channels_per_group = output_channels / groups;
const std::int32_t rounded_up_output_channels_per_group =
block_size_output_channels *
((output_channels_per_group + block_size_output_channels - 1) /
block_size_output_channels);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The functions PackWeights and FillIndirectionBuffer are now member functions, whereas they were previously they were defined in prepare.h.

Comment on lines +182 to +189
// To be implemented by concrete subclasses.
virtual void Run(const std::int32_t pixel_start, const std::int32_t pixel_end,
void* output_ptr) const = 0;

void Dispatch(void* output_ptr) const {
// TODO: implement multithreading here.
Run(0, num_output_pixels, output_ptr);
};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Run is a virtual function that is overriden in derived classes, and performs the computation on some subset of input (range of output pixels).

Dispatch is what the op calls to run the kernel, which just calls Run for now, but in the future could use multiple threads, et cetera.

@AdamHillier AdamHillier force-pushed the grouped-convolutions-indirect-bgemm branch from 01c3286 to e83bbe9 Compare June 4, 2021 12:13
Copy link
Contributor

@CNugteren CNugteren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a few comments, I haven't gone through everything in detail, it is a lot of code. Also I'm not really familiar with these kernels myself, so I think it would be good if someone else also has a look at it.

larq_compute_engine/core/indirect_bgemm/kernel.h Outdated Show resolved Hide resolved
larq_compute_engine/core/indirect_bgemm/kernel.h Outdated Show resolved Hide resolved
filter_size * input_depth_per_group +
/* padding */ block_size_output_channels *
block_size_depth);
std::int32_t packed_weights_index = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and elsewhere in this struct: is it really needed to specify std::int32_t, i.e. does it need to be 32-bits specifically? Typically we would use int if we just want a system-default integer. Also, if we do care so much, doesn't it make sense to use uint32_t for some of the counters (this would normally be a size_t)?

Another alternative is to use auto in more places to avoid (unnecessary) conversion or re-specifications of types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's a fair point. I instinctively use int32_t by default for almost all C code I write because I think it's dumb that the default integer type could randomly be 16 bits on some targets. E.g. with Rust you almost always use explicit integer types i8, i16, i32, i64, et cetera.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. So please do use int32_t if it is needed, but for most places where it doesn't matter that much (e.g. a counter from 0 till the filter width) I guess a regular int is better suited?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what its worth, I think the common thing to use in C++ for 'lenghts of arrays' and array indices is to use size_t which is unsigned and pointer-sized, i.e. 64-bit on 64-bit systems, to ensure you can exactly index everything that you could point potentially point to. An int is often 32-bit on 64-bit systems so then you can't index anything outside of a 4 GB array.
In this particular case I don't think it matters.

larq_compute_engine/core/indirect_bgemm/kernel.h Outdated Show resolved Hide resolved
larq_compute_engine/core/indirect_bgemm/kernel.h Outdated Show resolved Hide resolved
larq_compute_engine/tflite/kernels/bconv2d.cc Outdated Show resolved Hide resolved
@lgeiger
Copy link
Member

lgeiger commented Jun 4, 2021

Looks like this currently break our make build using the manylinux2010 compiler, probably because it still uses --std=c++11 instead of --std=c++14 somewhere.

@AdamHillier AdamHillier force-pushed the grouped-convolutions-indirect-bgemm branch from b4d3580 to c9c1d15 Compare June 4, 2021 14:44
@AdamHillier
Copy link
Contributor Author

As a sanity check I also ran this code on my Android phone, comparing this PR against the existing indirect BGEMM on main - there was no latency difference at all for non-grouped models. The direct BGEMM implementation was slightly slower, as expected.

@AdamHillier AdamHillier force-pushed the grouped-convolutions-indirect-bgemm branch from 1b0b375 to cf25496 Compare June 5, 2021 22:08
larq_compute_engine/core/types.h Outdated Show resolved Hide resolved
larq_compute_engine/core/types.h Show resolved Hide resolved
larq_compute_engine/core/indirect_bgemm/select_kernel.h Outdated Show resolved Hide resolved
@AdamHillier AdamHillier force-pushed the grouped-convolutions-indirect-bgemm branch from 2d0efe0 to 329ab02 Compare June 7, 2021 11:15
@AdamHillier AdamHillier merged commit 6e0f432 into main Jun 7, 2021
@AdamHillier AdamHillier deleted the grouped-convolutions-indirect-bgemm branch June 7, 2021 16:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants