Add grouped binary convolution support (3/3): indirect BGEMM kernel. #551

AdamHillier · 2020-10-22T23:46:04Z

What do these changes do?

This is third of a group of PRs to add support for grouped binary convolutions. I've split the work into three PRs to make review easier.

This PR adds support for grouped convolutions to the optimised indirect BGEMM kernel.

How Has This Been Tested?

This functionality is tested by the kernel tests added in #550. End2End tests are not currently possible because the default enabled kernel is the im2col + BGEMM kernel, which doesn't support grouped convolutions.

Benchmark Results

I'm going to leave this temporarily as a draft because I still need to run some benchmarks to a) work out how much of a speedup using grouped convolutions gives, and b) if there's any impact on the performance of normal non-grouped convolutions.

Related issue number

#549, #550.

lgeiger

Awesome work! I am currious to see the benchmarks 🚀

larq_compute_engine/core/bconv2d/optimized_indirect_bgemm.h

AdamHillier · 2020-11-06T17:55:56Z

Awesome work! I am currious to see the benchmarks 🚀

So the gist of the benchmarks is that there are lovely speedups when using grouped convolutions, not much worse than the difference in MACs.

However, for non-grouped convolutions I'm recording a small but consistent slowdown, so I need to work out how to fix that before proceeding with this PR. I think it might require a bunch of templating that will ensure that certain instructions are only included for actual grouped convolutions.

I'll keep this a draft until I can resolve those issues.

AdamHillier · 2020-11-10T17:53:21Z

I've just pushed to update this branch with the work I did to template the kernel functions and hopefully eliminate the performance regression for non-grouped convolutions. Unfortunately, it doesn't seem to have worked.

Below is a table with all of the ops of the (non-grouped) QuickNetLarge model, running on my Raspberry Pi 4B, using the indirect BGEMM kernels, comparing this PR against the baseline. Also included are the per-op absolute and relative difference in latency. The table is sorted by the absolute latency difference, in decreasing order. Note that the overall model latency of QuickNetLarge with this PR is 1-2 ms slower than with the baseline. The weird thing is that the biggest per-op differences came from six ADD ops, and those ADD ops together account for the entire 1-2 ms slowdown. I really don't know what to make of that, since the model flatbuffer file I'm using is identical for the two runs, and thus the tensors are certainly the same shape, and there's absolutely no reason for this PR to affect the latency of the ADD ops. Conversely, when looking at the rest of the ops it's clear that the LceBConv2d op has approximately unchanged latency, overall. It's very weird.

Op breakdown

Op	Baseline	PR	Difference (absolute)	Difference (relative)
ADD	0.598	0.96	0.362	0.605
ADD	0.602	0.959	0.357	0.593
ADD	0.606	0.958	0.352	0.581
ADD	0.604	0.938	0.334	0.553
ADD	0.607	0.936	0.329	0.542
ADD	0.688	0.958	0.270	0.392
CONV_2D	2.023	2.079	0.056	0.028
LceBconv2d	1.66	1.702	0.042	0.025
DEPTHWISE_CONV_2D	0.893	0.931	0.038	0.043
LceQuantize	0.154	0.189	0.035	0.227
LceBconv2d	1.75	1.783	0.033	0.019
LceBconv2d	1.881	1.9	0.019	0.010
LceBconv2d	1.888	1.906	0.018	0.010
LceQuantize	0.137	0.154	0.017	0.124
LceQuantize	0.143	0.159	0.016	0.112
LceQuantize	0.133	0.148	0.015	0.113
LceBconv2d	1.658	1.672	0.014	0.008
LceQuantize	0.134	0.148	0.014	0.104
LceQuantize	0.141	0.152	0.011	0.078
LceBconv2d	1.938	1.948	0.010	0.005
LceQuantize	0.14	0.149	0.009	0.064
LceQuantize	0.081	0.089	0.008	0.099
LceQuantize	0.216	0.223	0.007	0.032
LceQuantize	0.044	0.051	0.007	0.159
LceBconv2d	1.659	1.666	0.007	0.004
LceBconv2d	1.64	1.646	0.006	0.004
LceBconv2d	1.641	1.647	0.006	0.004
LceBconv2d	1.644	1.649	0.005	0.003
LceQuantize	0.219	0.224	0.005	0.023
LceQuantize	0.217	0.221	0.004	0.018
LceBconv2d	1.64	1.644	0.004	0.002
LceBconv2d	1.648	1.652	0.004	0.002
LceBconv2d	1.654	1.658	0.004	0.002
LceBconv2d	1.668	1.671	0.003	0.002
LceBconv2d	1.69	1.693	0.003	0.002
LceBconv2d	1.649	1.652	0.003	0.002
FULLY_CONNECTED	0.672	0.674	0.002	0.003
DEPTHWISE_CONV_2D	0.105	0.106	0.001	0.010
LceQuantize	0.219	0.22	0.001	0.005
LceBconv2d	1.917	1.917	0.000	0.000
LceQuantize	0.222	0.222	0.000	0.000
LceBconv2d	1.644	1.644	0.000	0.000
RESHAPE	0.001	0.001	0.000	0.000
SOFTMAX	0.02	0.02	0.000	0.000
LceBconv2d	1.906	1.905	-0.001	-0.001
CONV_2D	1.666	1.665	-0.001	-0.001
LceBconv2d	1.654	1.653	-0.001	-0.001
LceBconv2d	1.654	1.653	-0.001	-0.001
LceQuantize	0.035	0.034	-0.001	-0.029
AVERAGE_POOL_2D	0.031	0.029	-0.002	-0.065
LceQuantize	0.019	0.017	-0.002	-0.105
LceQuantize	0.019	0.017	-0.002	-0.105
LceBconv2d	1.646	1.644	-0.002	-0.001
LceBconv2d	1.654	1.651	-0.003	-0.002
LceQuantize	0.02	0.017	-0.003	-0.150
LceQuantize	0.039	0.036	-0.003	-0.077
LceQuantize	0.021	0.018	-0.003	-0.143
LceBconv2d	1.693	1.69	-0.003	-0.002
MAX_POOL_2D	0.565	0.561	-0.004	-0.007
LceQuantize	0.043	0.039	-0.004	-0.093
LceQuantize	0.039	0.035	-0.004	-0.103
LceQuantize	0.039	0.035	-0.004	-0.103
LceQuantize	0.039	0.035	-0.004	-0.103
MAX_POOL_2D	1.451	1.447	-0.004	-0.003
LceQuantize	0.04	0.036	-0.004	-0.100
LceBconv2d	1.896	1.891	-0.005	-0.003
LceQuantize	0.044	0.039	-0.005	-0.114
LceQuantize	0.04	0.035	-0.005	-0.125
LceQuantize	0.04	0.035	-0.005	-0.125
LceQuantize	0.04	0.035	-0.005	-0.125
MAX_POOL_2D	0.244	0.239	-0.005	-0.020
LceQuantize	0.041	0.035	-0.006	-0.146
LceBconv2d	1.715	1.709	-0.006	-0.003
LceBconv2d	1.665	1.659	-0.006	-0.004
LceQuantize	0.456	0.45	-0.006	-0.013
ADD	0.084	0.076	-0.008	-0.095
DEPTHWISE_CONV_2D	0.237	0.227	-0.010	-0.042
ADD	0.075	0.064	-0.011	-0.147
ADD	0.053	0.042	-0.011	-0.208
ADD	0.056	0.044	-0.012	-0.214
ADD	0.078	0.065	-0.013	-0.167
ADD	0.077	0.064	-0.013	-0.169
ADD	0.054	0.041	-0.013	-0.241
LceQuantize	0.131	0.117	-0.014	-0.107
ADD	0.081	0.067	-0.014	-0.173
ADD	0.08	0.066	-0.014	-0.175
ADD	0.327	0.313	-0.014	-0.043
ADD	0.056	0.041	-0.015	-0.268
LceBconv2d	1.697	1.682	-0.015	-0.009
ADD	0.082	0.066	-0.016	-0.195
ADD	0.084	0.067	-0.017	-0.202
ADD	0.082	0.065	-0.017	-0.207
ADD	0.081	0.064	-0.017	-0.210
ADD	0.058	0.04	-0.018	-0.310
LceBconv2d	1.698	1.68	-0.018	-0.011
CONV_2D	1.226	1.208	-0.018	-0.015
LceBconv2d	1.67	1.651	-0.019	-0.011
ADD	0.086	0.067	-0.019	-0.221
ADD	0.084	0.065	-0.019	-0.226
ADD	0.24	0.218	-0.022	-0.092
CONV_2D	1.522	1.499	-0.023	-0.015
ADD	0.118	0.094	-0.024	-0.203
ADD	0.249	0.224	-0.025	-0.100
LceBconv2d	1.693	1.666	-0.027	-0.016
ADD	0.246	0.217	-0.029	-0.118
ADD	0.247	0.215	-0.032	-0.130
LceBconv2d	1.686	1.654	-0.032	-0.019
ADD	0.242	0.209	-0.033	-0.136
DEPTHWISE_CONV_2D	0.685	0.652	-0.033	-0.048
ADD	0.255	0.213	-0.042	-0.165
ADD	0.249	0.202	-0.047	-0.189
CONV_2D	4.243	4.078	-0.165	-0.039

Add support for grouped binary convolutions to the optimised indirect BGEMM kernel.

AdamHillier · 2021-06-04T10:57:00Z

I've rebased this PR on top of main, incorporating the TF 2.5 changes.

Assessing non-grouped performance

I'm not sure how, but now the performance regression I previously noticed on the Raspberry Pi 4B has disappeared!

These are measurements taken on my Raspberry Pi 4B (using non-grouped QuickNet models), showing the average latency from 100 runs and the std deviation. It looks like there might still be a slightly slow-down for the larger models, but it's very slight (previously I observed a 1-2 ms slowdown).

Model	Latency, indirect BGEMM baseline (ms)	Latency, indirect BGEMM this PR (ms)
QuickNet Small	30.02 +- 0.19	29.75 +- 0.13
QuickNet	48.06 +- 0.06	48.37 +- 0.06
QuickNet Large	78.92 +- 0.07	79.36 +- 0.09

In addition, per-op profiling now shows no significant difference between the latencies of the ADD ops, which was the concerning thing I previously saw.

Assessing grouped performance

I generated a version of the QuickNet model used in the table above, but replaced every binary convolution with a grouped binary convolution with two groups. This means that the number of binary MACs is exactly half, giving a theoretical 2x speedup for the binary convolutions.

Here are benchmark results for both the whole model and just the binary convolutions (per-op benchmarks don't report std deviations), running on the same Raspberry Pi 4B device:

Model	Latency, whole model (ms)	Latency, binary convolutions (ms)
QuickNet (no groups)	48.37 +- 0.06	27.86
QuickNet (two groups)	36.51 +- 0.05	16.56

There is a ~24% whole-model latency decrease, and ~41% binary convolution latency decrease.

Note that it wouldn't be possible to actually speed up the binary convolutions by 2x, because we still have to do the same number of output-transformations whether we use groups or not.

AdamHillier · 2021-06-04T10:58:44Z

From my end this is now good to go - for review, I'd recommend using the 'Hide whitespace changes' (as often suggested by @CNugteren).

lgeiger

This is great 🚀
I am very happy to see that the performance regression solved itself by the upgrade to TF 2.5.

AdamHillier · 2021-06-04T11:01:57Z