Rewrite integer lerp using intrinsics #6426

vksnk · 2021-11-17T23:27:56Z

Should be pretty much the same thing as before, but easier to read and maybe may produce better code.

abadams · 2021-11-17T23:33:00Z

src/Lerp.cpp

- Expr divided = ((prod_sum / divisor) + prod_sum) / divisor;
+ Expr shift = Cast::make(UInt(2 * bits), bits);
+ Expr prod_sum = widening_mul(zero_val, inverse_typed_weight) + widening_mul(one_val, typed_weight);
+ Expr divided = rounding_shift_right(rounding_shift_right(prod_sum, shift) + prod_sum, shift);


This is equivalent to the original code, but I'm not convinced it's a good idea on x86. I bet you can do better with an average-round-up instruction.

It's a trick to do x/255 as (x/256 + x)/256

We should document that trick in a comment.

abadams · 2021-11-17T23:33:34Z

I might spend some time trying to figure out how to do that division by 2^n - 1 more cheaply, but that would be a follow-up PR.

dsharletg · 2021-11-17T23:35:29Z

src/Lerp.cpp

- Expr divided = ((prod_sum / divisor) + prod_sum) / divisor;
+ Expr shift = Cast::make(UInt(2 * bits), bits);
+ Expr prod_sum = widening_mul(zero_val, inverse_typed_weight) + widening_mul(one_val, typed_weight);
+ Expr divided = rounding_shift_right(rounding_shift_right(prod_sum, shift) + prod_sum, shift);


This looks odd. I agree it should be the same as the old code, but I think the inner rounding_shift_right expression will be negligible. @zvookin @abadams any idea what this was about? Is it maybe attempting to ensure we actually hit one_val when typed_weight is the max?

It's a trick to do x/255 as (x/256 + x)/256

OK yeah, division by 2^n-1...

Note that this is a special case of one of our unsigned division lowering methods. I was about to suggest just dividing and relying on that, but that only covers up to 255

Actually I could easily extend that table to include 65535 and 2^32 - 1 ...

btw the nasty thing about this trick is that it fails for the largest uint16_t, so you have to have some way to know that's not possible (as we do here).

On the other hand, x/255 == ((x + 1)/256 + x + 1)/256 seems to hold for all uint16 x.

So maybe the resolution here is teaching the arm backend this trick for division by 255

Edit: Except that it overflows for the largest uint16. Sigh.

Do any of us have a copy of the book "Hacker's Delight"? IIRC it has quite a long section on this sort of technique, but I last read it at a different employer ~20 years ago...

Here's a useful identity for uint16s:

cast<uint8_t>(x / 255) === let y = saturating_sub(x, 127) in cast<uint8_t>(rounding_shift_right(halving_add(rounding_shift_right(y, 8), y), 7))

On ARM we currently lower x/255 as:

umull v3.4s, v2.4h, v0.4h umull2 v2.4s, v2.8h, v1.8h ushr v3.4s, v3.4s, #23 ushr v2.4s, v2.4s, #23 xtn v3.4h, v3.4s xtn2 v3.8h, v2.4s xtn v2.8b, v3.8h

but using this identity we get:

uqsub v1.8h, v1.8h, v0.8h urshr v2.8h, v1.8h, #8 uhadd v1.8h, v2.8h, v1.8h rshrn v1.8b, v1.8h, #7

Nice.

Edit: nevermind, this also overflows in the addition. Dang.

Second edit: Fixed, by using an averaging op at the cost of an additional instruction. Now it looks a lot like method 2 of our division methods...

Do any of us have a copy of the book "Hacker's Delight"? IIRC it has quite a long section on this sort of technique, but I last read it at a different employer ~20 years ago...

@steven-johnson - very late, I know, but I keep a copy on my desk. If it ever comes up again that we want to consult this book, let me know.

abadams · 2021-11-18T01:04:31Z

My overall conclusion is that we should merge this PR with a comment explaining the trick and a TODO that notes it's actually one instruction cheaper to do the division directly on x86.

steven-johnson · 2021-11-18T21:28:48Z

Unrelated, but correctness_custom_cuda_context has been failing on Windows for a while now; is anyone looking into it?

vksnk · 2021-11-19T17:10:06Z

I am going to submit this (tests failure is unrelated) and create an issue to track if there are any better ways to do x / 255 (so discussion in the comments is not lost).

abadams · 2021-11-19T18:02:37Z

This actually hurt lerp on x86, because rounding shift right is more expensive than (x + 128) / 256.

Before:

	vpaddw	%ymm0, %ymm1, %ymm1
	vpsrlw	$8, %ymm1, %ymm2
	vpaddw	%ymm2, %ymm1, %ymm1
	vpsrlw	$8, %ymm1, %ymm1
	vextracti128	$1, %ymm1, %xmm2
	vpackuswb	%xmm2, %xmm1, %xmm1

After:

	vpsrlw	$7, %ymm2, %ymm3
	vpand	%ymm0, %ymm3, %ymm3
	vpsrlw	$8, %ymm2, %ymm4
	vpaddw	%ymm2, %ymm4, %ymm2
	vpaddw	%ymm3, %ymm2, %ymm2
	vpsrlw	$7, %ymm2, %ymm3
	vpand	%ymm0, %ymm3, %ymm3
	vpsrlw	$8, %ymm2, %ymm2
	vpaddw	%ymm2, %ymm3, %ymm2
	vpand	%ymm1, %ymm2, %ymm2
	vextracti128	$1, %ymm2, %xmm3
	vpackuswb	%xmm3, %xmm2, %xmm2

I'm changing the lowering on x86 anyway, but are there other targets that don't have a rounding_shift_right instruction?

Saves instructions on x86. Before #6426 vpaddw %ymm0, %ymm1, %ymm1 vpsrlw $8, %ymm1, %ymm2 vpaddw %ymm2, %ymm1, %ymm1 vpsrlw $8, %ymm1, %ymm1 After #6426 vpsrlw $7, %ymm2, %ymm3 vpand %ymm0, %ymm3, %ymm3 vpsrlw $8, %ymm2, %ymm4 vpaddw %ymm2, %ymm4, %ymm2 vpaddw %ymm3, %ymm2, %ymm2 vpsrlw $7, %ymm2, %ymm3 vpand %ymm0, %ymm3, %ymm3 vpsrlw $8, %ymm2, %ymm2 vpaddw %ymm2, %ymm3, %ymm2 vpand %ymm1, %ymm2, %ymm2 This PR: vpaddw %ymm0, %ymm3, %ymm3 vpmulhuw %ymm1, %ymm3, %ymm3 vpsrlw $7, %ymm3, %ymm3

* Do target-specific lowering of lerp Saves instructions on x86. Before #6426 vpaddw %ymm0, %ymm1, %ymm1 vpsrlw $8, %ymm1, %ymm2 vpaddw %ymm2, %ymm1, %ymm1 vpsrlw $8, %ymm1, %ymm1 After #6426 vpsrlw $7, %ymm2, %ymm3 vpand %ymm0, %ymm3, %ymm3 vpsrlw $8, %ymm2, %ymm4 vpaddw %ymm2, %ymm4, %ymm2 vpaddw %ymm3, %ymm2, %ymm2 vpsrlw $7, %ymm2, %ymm3 vpand %ymm0, %ymm3, %ymm3 vpsrlw $8, %ymm2, %ymm2 vpaddw %ymm2, %ymm3, %ymm2 vpand %ymm1, %ymm2, %ymm2 This PR: vpaddw %ymm0, %ymm3, %ymm3 vpmulhuw %ymm1, %ymm3, %ymm3 vpsrlw $7, %ymm3, %ymm3 * Target is a struct

Rewrite integer lerp using intrinsics

519930d

abadams reviewed Nov 17, 2021

View reviewed changes

abadams approved these changes Nov 17, 2021

View reviewed changes

dsharletg reviewed Nov 17, 2021

View reviewed changes

Comment

c1fb3e0

vksnk merged commit c3040cb into master Nov 19, 2021

vksnk mentioned this pull request Nov 19, 2021

Best way to divide by 2**N - 1 in lerp #6431

Closed

abadams mentioned this pull request Nov 19, 2021

Do target-specific lowering of lerp #6432

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite integer lerp using intrinsics #6426

Rewrite integer lerp using intrinsics #6426

vksnk commented Nov 17, 2021

abadams Nov 17, 2021

steven-johnson Nov 18, 2021

abadams commented Nov 17, 2021

dsharletg Nov 17, 2021

abadams Nov 17, 2021

dsharletg Nov 17, 2021

abadams Nov 17, 2021

abadams Nov 17, 2021

abadams Nov 18, 2021

abadams Nov 18, 2021 •

edited

Loading

steven-johnson Nov 18, 2021

abadams Nov 18, 2021 •

edited

Loading

alexreinking Apr 7, 2022

abadams commented Nov 18, 2021

steven-johnson commented Nov 18, 2021

vksnk commented Nov 19, 2021

abadams commented Nov 19, 2021

Rewrite integer lerp using intrinsics #6426

Rewrite integer lerp using intrinsics #6426

Conversation

vksnk commented Nov 17, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abadams commented Nov 17, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abadams Nov 18, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abadams Nov 18, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abadams commented Nov 18, 2021

steven-johnson commented Nov 18, 2021

vksnk commented Nov 19, 2021

abadams commented Nov 19, 2021

abadams Nov 18, 2021 •

edited

Loading

abadams Nov 18, 2021 •

edited

Loading