-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add grouped binary convolution support (3/3): indirect BGEMM kernel. #551
Conversation
9f32b05
to
1fe65da
Compare
950b8ac
to
c1676a2
Compare
1fe65da
to
b15e227
Compare
4b19e82
to
3846e33
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work! I am currious to see the benchmarks 🚀
So the gist of the benchmarks is that there are lovely speedups when using grouped convolutions, not much worse than the difference in MACs. However, for non-grouped convolutions I'm recording a small but consistent slowdown, so I need to work out how to fix that before proceeding with this PR. I think it might require a bunch of templating that will ensure that certain instructions are only included for actual grouped convolutions. I'll keep this a draft until I can resolve those issues. |
3846e33
to
910183a
Compare
I've just pushed to update this branch with the work I did to template the kernel functions and hopefully eliminate the performance regression for non-grouped convolutions. Unfortunately, it doesn't seem to have worked. Below is a table with all of the ops of the (non-grouped) QuickNetLarge model, running on my Raspberry Pi 4B, using the indirect BGEMM kernels, comparing this PR against the baseline. Also included are the per-op absolute and relative difference in latency. The table is sorted by the absolute latency difference, in decreasing order. Note that the overall model latency of QuickNetLarge with this PR is 1-2 ms slower than with the baseline. The weird thing is that the biggest per-op differences came from six Op breakdown
|
910183a
to
f2f379f
Compare
f2f379f
to
471b3cb
Compare
Add support for grouped binary convolutions to the optimised indirect BGEMM kernel.
471b3cb
to
705dfbf
Compare
I've rebased this PR on top of main, incorporating the TF 2.5 changes. Assessing non-grouped performanceI'm not sure how, but now the performance regression I previously noticed on the Raspberry Pi 4B has disappeared! These are measurements taken on my Raspberry Pi 4B (using non-grouped QuickNet models), showing the average latency from 100 runs and the std deviation. It looks like there might still be a slightly slow-down for the larger models, but it's very slight (previously I observed a 1-2 ms slowdown).
In addition, per-op profiling now shows no significant difference between the latencies of the Assessing grouped performanceI generated a version of the Here are benchmark results for both the whole model and just the binary convolutions (per-op benchmarks don't report std deviations), running on the same Raspberry Pi 4B device:
There is a ~24% whole-model latency decrease, and ~41% binary convolution latency decrease. Note that it wouldn't be possible to actually speed up the binary convolutions by 2x, because we still have to do the same number of output-transformations whether we use groups or not. |
From my end this is now good to go - for review, I'd recommend using the 'Hide whitespace changes' (as often suggested by @CNugteren). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great 🚀
I am very happy to see that the performance regression solved itself by the upgrade to TF 2.5.
daae283
to
01c3286
Compare
const std::int32_t input_depth; | ||
const std::int32_t output_channels; | ||
const std::int32_t filter_size; | ||
const std::int32_t groups; | ||
const std::int32_t num_output_pixels; | ||
|
||
std::vector<TBitpacked> packed_weights; | ||
std::vector<const TBitpacked*> indirection_buffer; | ||
std::vector<TBitpacked> zero_buffer; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've bundled everything related to the runtime kernel into a struct called Kernel
, which stores all the information it needs (instead of needing these various vectors to be stored directly in the op_data
, which was a bit messy).
/** | ||
* Pack the weights in the correct order. This procedure is (heavily) | ||
* adapted from the following XNNPack function: | ||
* https://github.com/google/XNNPACK/blob/80a8ac59849bfdae8d2e1409f5642baa502c0b9e/src/packing.c#L429-L484 | ||
*/ | ||
void PackWeights(const TBitpacked* weights_ptr) { | ||
const std::int32_t input_depth_per_group = input_depth / groups; | ||
const std::int32_t output_channels_per_group = output_channels / groups; | ||
const std::int32_t rounded_up_output_channels_per_group = | ||
block_size_output_channels * | ||
((output_channels_per_group + block_size_output_channels - 1) / | ||
block_size_output_channels); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The functions PackWeights
and FillIndirectionBuffer
are now member functions, whereas they were previously they were defined in prepare.h
.
// To be implemented by concrete subclasses. | ||
virtual void Run(const std::int32_t pixel_start, const std::int32_t pixel_end, | ||
void* output_ptr) const = 0; | ||
|
||
void Dispatch(void* output_ptr) const { | ||
// TODO: implement multithreading here. | ||
Run(0, num_output_pixels, output_ptr); | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Run
is a virtual function that is overriden in derived classes, and performs the computation on some subset of input (range of output pixels).
Dispatch
is what the op calls to run the kernel, which just calls Run
for now, but in the future could use multiple threads, et cetera.
01c3286
to
e83bbe9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left a few comments, I haven't gone through everything in detail, it is a lot of code. Also I'm not really familiar with these kernels myself, so I think it would be good if someone else also has a look at it.
filter_size * input_depth_per_group + | ||
/* padding */ block_size_output_channels * | ||
block_size_depth); | ||
std::int32_t packed_weights_index = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here and elsewhere in this struct: is it really needed to specify std::int32_t
, i.e. does it need to be 32-bits specifically? Typically we would use int
if we just want a system-default integer. Also, if we do care so much, doesn't it make sense to use uint32_t
for some of the counters (this would normally be a size_t
)?
Another alternative is to use auto
in more places to avoid (unnecessary) conversion or re-specifications of types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's a fair point. I instinctively use int32_t
by default for almost all C code I write because I think it's dumb that the default integer type could randomly be 16 bits on some targets. E.g. with Rust you almost always use explicit integer types i8
, i16
, i32
, i64
, et cetera.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. So please do use int32_t
if it is needed, but for most places where it doesn't matter that much (e.g. a counter from 0 till the filter width) I guess a regular int
is better suited?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For what its worth, I think the common thing to use in C++ for 'lenghts of arrays' and array indices is to use size_t
which is unsigned and pointer-sized, i.e. 64-bit on 64-bit systems, to ensure you can exactly index everything that you could point potentially point to. An int
is often 32-bit on 64-bit systems so then you can't index anything outside of a 4 GB array.
In this particular case I don't think it matters.
Looks like this currently break our make build using the manylinux2010 compiler, probably because it still uses |
b4d3580
to
c9c1d15
Compare
As a sanity check I also ran this code on my Android phone, comparing this PR against the existing indirect BGEMM on |
1b0b375
to
cf25496
Compare
2d0efe0
to
329ab02
Compare
What do these changes do?
This is third of a group of PRs to add support for grouped binary convolutions. I've split the work into three PRs to make review easier.
This PR adds support for grouped convolutions to the optimised indirect BGEMM kernel.
How Has This Been Tested?
This functionality is tested by the kernel tests added in #550. End2End tests are not currently possible because the default enabled kernel is the im2col + BGEMM kernel, which doesn't support grouped convolutions.
Benchmark Results
I'm going to leave this temporarily as a draft because I still need to run some benchmarks to a) work out how much of a speedup using grouped convolutions gives, and b) if there's any impact on the performance of normal non-grouped convolutions.
Related issue number
#549, #550.