Add optimised 'Indirect BGEMM' binary convolution kernels. #516
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What do these changes do?
This PR adds a new type of binary convolution kernel that uses an 'indirect' BGEMM algorithm that doesn't require im2col. This is an adaptation of the algorithm introduced in the paper The Indirect Convolution Algorithm and used extensively in the XNNPack library.
Only one BGEMM-micro kernel is included: a portable 4x2 kernel written in C++. However, this PR lays the groundwork for adding additional micro-kernels -- including hand-optimised and architecture-specific variations -- in the future. As such, the focus of this PR is not performance; the new kernel will be substantially slower than our existing highly-optimised im2col + BGEMM kernel, which will remain the default.
How Has This Been Tested?
CI. The non-CI 'big' kernel tests pass locally for Aarch64 and Arm32.
Benchmark Results
Benchmarks aren't really relevant because this PR does not change the default optimised kernel that is run, but to give a rough idea I ran QuickNet on my Raspberry Pi 4B with our three different kernel types:
Related issue number
N/A.