Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite integer lerp using intrinsics #6426

Merged
merged 2 commits into from
Nov 19, 2021
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 5 additions & 11 deletions src/Lerp.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -134,17 +134,11 @@ Expr lower_lerp(Expr zero_val, Expr one_val, const Expr &weight) {
case 8:
case 16:
case 32: {
Expr zero_expand = Cast::make(UInt(2 * bits, computation_type.lanes()),
zero_val);
Expr one_expand = Cast::make(UInt(2 * bits, one_val.type().lanes()),
one_val);

Expr rounding = Cast::make(UInt(2 * bits), 1) << Cast::make(UInt(2 * bits), (bits - 1));
Expr divisor = Cast::make(UInt(2 * bits), 1) << Cast::make(UInt(2 * bits), bits);

Expr prod_sum = zero_expand * inverse_typed_weight +
one_expand * typed_weight + rounding;
Expr divided = ((prod_sum / divisor) + prod_sum) / divisor;
Expr shift = Cast::make(UInt(2 * bits), bits);
Expr prod_sum = widening_mul(zero_val, inverse_typed_weight) + widening_mul(one_val, typed_weight);
// Computes x / (2 ** N - 1) as (x / 2 ** N + x) / 2 ** N.
// TODO: on x86 it's actually one instruction cheaper to do the division directly.
Expr divided = rounding_shift_right(rounding_shift_right(prod_sum, shift) + prod_sum, shift);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is equivalent to the original code, but I'm not convinced it's a good idea on x86. I bet you can do better with an average-round-up instruction.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a trick to do x/255 as (x/256 + x)/256

We should document that trick in a comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks odd. I agree it should be the same as the old code, but I think the inner rounding_shift_right expression will be negligible. @zvookin @abadams any idea what this was about? Is it maybe attempting to ensure we actually hit one_val when typed_weight is the max?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a trick to do x/255 as (x/256 + x)/256

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK yeah, division by 2^n-1...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this is a special case of one of our unsigned division lowering methods. I was about to suggest just dividing and relying on that, but that only covers up to 255

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I could easily extend that table to include 65535 and 2^32 - 1 ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw the nasty thing about this trick is that it fails for the largest uint16_t, so you have to have some way to know that's not possible (as we do here).

Copy link
Member

@abadams abadams Nov 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, x/255 == ((x + 1)/256 + x + 1)/256 seems to hold for all uint16 x.

So maybe the resolution here is teaching the arm backend this trick for division by 255

Edit: Except that it overflows for the largest uint16. Sigh.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do any of us have a copy of the book "Hacker's Delight"? IIRC it has quite a long section on this sort of technique, but I last read it at a different employer ~20 years ago...

Copy link
Member

@abadams abadams Nov 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a useful identity for uint16s:

cast<uint8_t>(x / 255) === let y = saturating_sub(x, 127) in cast<uint8_t>(rounding_shift_right(halving_add(rounding_shift_right(y, 8), y), 7))

On ARM we currently lower x/255 as:

	umull	v3.4s, v2.4h, v0.4h
	umull2	v2.4s, v2.8h, v1.8h
	ushr	v3.4s, v3.4s, #23
	ushr	v2.4s, v2.4s, #23
	xtn	v3.4h, v3.4s
	xtn2	v3.8h, v2.4s
	xtn	v2.8b, v3.8h

but using this identity we get:

	uqsub	v1.8h, v1.8h, v0.8h
	urshr	v2.8h, v1.8h, #8
	uhadd	v1.8h, v2.8h, v1.8h
	rshrn	v1.8b, v1.8h, #7

Nice.

Edit: nevermind, this also overflows in the addition. Dang.

Second edit: Fixed, by using an averaging op at the cost of an additional instruction. Now it looks a lot like method 2 of our division methods...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do any of us have a copy of the book "Hacker's Delight"? IIRC it has quite a long section on this sort of technique, but I last read it at a different employer ~20 years ago...

@steven-johnson - very late, I know, but I keep a copy on my desk. If it ever comes up again that we want to consult this book, let me know.


result = Cast::make(UInt(bits, computation_type.lanes()), divided);
break;
Expand Down