Add optimised 'Indirect BGEMM' binary convolution kernels. #516

AdamHillier · 2020-09-23T00:36:02Z

What do these changes do?

This PR adds a new type of binary convolution kernel that uses an 'indirect' BGEMM algorithm that doesn't require im2col. This is an adaptation of the algorithm introduced in the paper The Indirect Convolution Algorithm and used extensively in the XNNPack library.

Only one BGEMM-micro kernel is included: a portable 4x2 kernel written in C++. However, this PR lays the groundwork for adding additional micro-kernels -- including hand-optimised and architecture-specific variations -- in the future. As such, the focus of this PR is not performance; the new kernel will be substantially slower than our existing highly-optimised im2col + BGEMM kernel, which will remain the default.

How Has This Been Tested?

CI. The non-CI 'big' kernel tests pass locally for Aarch64 and Arm32.

Benchmark Results

Benchmarks aren't really relevant because this PR does not change the default optimised kernel that is run, but to give a rough idea I ran QuickNet on my Raspberry Pi 4B with our three different kernel types:

Kernel	Average latency over 250 runs (ms)
Optimised im2col + BGEMM (hand-tuned assembly)	30.0
Optimised indirect BGEMM (C++) - this PR	128.8
Reference (C++)	269.5

Related issue number

N/A.

larq_compute_engine/core/indirect_bgemm/kernel_4x2_portable.h

larq_compute_engine/core/indirect_bgemm/prepare.h

Tombana

🚀

To start, add portable 4x2 C++ kernels for float/int8/bitpacked output. Facilitate easy implementation of new indirect bgemm kernels, including architecture-specific variations.

AdamHillier requested a review from a team September 23, 2020 00:36

Tombana reviewed Sep 23, 2020

View reviewed changes

larq_compute_engine/core/indirect_bgemm/kernel_4x2_portable.h Show resolved Hide resolved

lgeiger reviewed Sep 23, 2020

View reviewed changes

larq_compute_engine/core/indirect_bgemm/prepare.h Show resolved Hide resolved

larq_compute_engine/core/indirect_bgemm/prepare.h Show resolved Hide resolved

Tombana approved these changes Sep 23, 2020

View reviewed changes

lgeiger added the internal-improvement Internal Improvements and Maintenance label Sep 24, 2020

Add optimised 'Indirect BGEMM' binary convolution kernels.

c057b6f

To start, add portable 4x2 C++ kernels for float/int8/bitpacked output. Facilitate easy implementation of new indirect bgemm kernels, including architecture-specific variations.

AdamHillier force-pushed the indirect-bgemm branch from 11e57cd to c057b6f Compare September 29, 2020 13:44

AdamHillier merged commit 9396e17 into larq:master Sep 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add optimised 'Indirect BGEMM' binary convolution kernels. #516

Add optimised 'Indirect BGEMM' binary convolution kernels. #516

AdamHillier commented Sep 23, 2020

Tombana left a comment

Add optimised 'Indirect BGEMM' binary convolution kernels. #516

Add optimised 'Indirect BGEMM' binary convolution kernels. #516

Conversation

AdamHillier commented Sep 23, 2020

What do these changes do?

How Has This Been Tested?

Benchmark Results

Related issue number

Tombana left a comment

Choose a reason for hiding this comment