Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite integer lerp using intrinsics #6426

Merged
merged 2 commits into from
Nov 19, 2021
Merged

Rewrite integer lerp using intrinsics #6426

merged 2 commits into from
Nov 19, 2021

Conversation

vksnk
Copy link
Member

@vksnk vksnk commented Nov 17, 2021

Should be pretty much the same thing as before, but easier to read and maybe may produce better code.

Expr divided = ((prod_sum / divisor) + prod_sum) / divisor;
Expr shift = Cast::make(UInt(2 * bits), bits);
Expr prod_sum = widening_mul(zero_val, inverse_typed_weight) + widening_mul(one_val, typed_weight);
Expr divided = rounding_shift_right(rounding_shift_right(prod_sum, shift) + prod_sum, shift);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is equivalent to the original code, but I'm not convinced it's a good idea on x86. I bet you can do better with an average-round-up instruction.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a trick to do x/255 as (x/256 + x)/256

We should document that trick in a comment.

@abadams
Copy link
Member

abadams commented Nov 17, 2021

I might spend some time trying to figure out how to do that division by 2^n - 1 more cheaply, but that would be a follow-up PR.

Expr divided = ((prod_sum / divisor) + prod_sum) / divisor;
Expr shift = Cast::make(UInt(2 * bits), bits);
Expr prod_sum = widening_mul(zero_val, inverse_typed_weight) + widening_mul(one_val, typed_weight);
Expr divided = rounding_shift_right(rounding_shift_right(prod_sum, shift) + prod_sum, shift);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks odd. I agree it should be the same as the old code, but I think the inner rounding_shift_right expression will be negligible. @zvookin @abadams any idea what this was about? Is it maybe attempting to ensure we actually hit one_val when typed_weight is the max?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a trick to do x/255 as (x/256 + x)/256

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK yeah, division by 2^n-1...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this is a special case of one of our unsigned division lowering methods. I was about to suggest just dividing and relying on that, but that only covers up to 255

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I could easily extend that table to include 65535 and 2^32 - 1 ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw the nasty thing about this trick is that it fails for the largest uint16_t, so you have to have some way to know that's not possible (as we do here).

Copy link
Member

@abadams abadams Nov 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, x/255 == ((x + 1)/256 + x + 1)/256 seems to hold for all uint16 x.

So maybe the resolution here is teaching the arm backend this trick for division by 255

Edit: Except that it overflows for the largest uint16. Sigh.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do any of us have a copy of the book "Hacker's Delight"? IIRC it has quite a long section on this sort of technique, but I last read it at a different employer ~20 years ago...

Copy link
Member

@abadams abadams Nov 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a useful identity for uint16s:

cast<uint8_t>(x / 255) === let y = saturating_sub(x, 127) in cast<uint8_t>(rounding_shift_right(halving_add(rounding_shift_right(y, 8), y), 7))

On ARM we currently lower x/255 as:

	umull	v3.4s, v2.4h, v0.4h
	umull2	v2.4s, v2.8h, v1.8h
	ushr	v3.4s, v3.4s, #23
	ushr	v2.4s, v2.4s, #23
	xtn	v3.4h, v3.4s
	xtn2	v3.8h, v2.4s
	xtn	v2.8b, v3.8h

but using this identity we get:

	uqsub	v1.8h, v1.8h, v0.8h
	urshr	v2.8h, v1.8h, #8
	uhadd	v1.8h, v2.8h, v1.8h
	rshrn	v1.8b, v1.8h, #7

Nice.

Edit: nevermind, this also overflows in the addition. Dang.

Second edit: Fixed, by using an averaging op at the cost of an additional instruction. Now it looks a lot like method 2 of our division methods...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do any of us have a copy of the book "Hacker's Delight"? IIRC it has quite a long section on this sort of technique, but I last read it at a different employer ~20 years ago...

@steven-johnson - very late, I know, but I keep a copy on my desk. If it ever comes up again that we want to consult this book, let me know.

@abadams
Copy link
Member

abadams commented Nov 18, 2021

My overall conclusion is that we should merge this PR with a comment explaining the trick and a TODO that notes it's actually one instruction cheaper to do the division directly on x86.

@steven-johnson
Copy link
Contributor

Unrelated, but correctness_custom_cuda_context has been failing on Windows for a while now; is anyone looking into it?

@vksnk
Copy link
Member Author

vksnk commented Nov 19, 2021

I am going to submit this (tests failure is unrelated) and create an issue to track if there are any better ways to do x / 255 (so discussion in the comments is not lost).

@vksnk vksnk merged commit c3040cb into master Nov 19, 2021
@abadams
Copy link
Member

abadams commented Nov 19, 2021

This actually hurt lerp on x86, because rounding shift right is more expensive than (x + 128) / 256.

Before:

	vpaddw	%ymm0, %ymm1, %ymm1
	vpsrlw	$8, %ymm1, %ymm2
	vpaddw	%ymm2, %ymm1, %ymm1
	vpsrlw	$8, %ymm1, %ymm1
	vextracti128	$1, %ymm1, %xmm2
	vpackuswb	%xmm2, %xmm1, %xmm1

After:

	vpsrlw	$7, %ymm2, %ymm3
	vpand	%ymm0, %ymm3, %ymm3
	vpsrlw	$8, %ymm2, %ymm4
	vpaddw	%ymm2, %ymm4, %ymm2
	vpaddw	%ymm3, %ymm2, %ymm2
	vpsrlw	$7, %ymm2, %ymm3
	vpand	%ymm0, %ymm3, %ymm3
	vpsrlw	$8, %ymm2, %ymm2
	vpaddw	%ymm2, %ymm3, %ymm2
	vpand	%ymm1, %ymm2, %ymm2
	vextracti128	$1, %ymm2, %xmm3
	vpackuswb	%xmm3, %xmm2, %xmm2

I'm changing the lowering on x86 anyway, but are there other targets that don't have a rounding_shift_right instruction?

abadams added a commit that referenced this pull request Nov 19, 2021
Saves instructions on x86.

Before #6426

	vpaddw	%ymm0, %ymm1, %ymm1
	vpsrlw	$8, %ymm1, %ymm2
	vpaddw	%ymm2, %ymm1, %ymm1
	vpsrlw	$8, %ymm1, %ymm1

After #6426

	vpsrlw	$7, %ymm2, %ymm3
	vpand	%ymm0, %ymm3, %ymm3
	vpsrlw	$8, %ymm2, %ymm4
	vpaddw	%ymm2, %ymm4, %ymm2
	vpaddw	%ymm3, %ymm2, %ymm2
	vpsrlw	$7, %ymm2, %ymm3
	vpand	%ymm0, %ymm3, %ymm3
	vpsrlw	$8, %ymm2, %ymm2
	vpaddw	%ymm2, %ymm3, %ymm2
	vpand	%ymm1, %ymm2, %ymm2

This PR:
        vpaddw  %ymm0, %ymm3, %ymm3
        vpmulhuw        %ymm1, %ymm3, %ymm3
        vpsrlw  $7, %ymm3, %ymm3
abadams added a commit that referenced this pull request Nov 19, 2021
* Do target-specific lowering of lerp

Saves instructions on x86.

Before #6426

	vpaddw	%ymm0, %ymm1, %ymm1
	vpsrlw	$8, %ymm1, %ymm2
	vpaddw	%ymm2, %ymm1, %ymm1
	vpsrlw	$8, %ymm1, %ymm1

After #6426

	vpsrlw	$7, %ymm2, %ymm3
	vpand	%ymm0, %ymm3, %ymm3
	vpsrlw	$8, %ymm2, %ymm4
	vpaddw	%ymm2, %ymm4, %ymm2
	vpaddw	%ymm3, %ymm2, %ymm2
	vpsrlw	$7, %ymm2, %ymm3
	vpand	%ymm0, %ymm3, %ymm3
	vpsrlw	$8, %ymm2, %ymm2
	vpaddw	%ymm2, %ymm3, %ymm2
	vpand	%ymm1, %ymm2, %ymm2

This PR:
        vpaddw  %ymm0, %ymm3, %ymm3
        vpmulhuw        %ymm1, %ymm3, %ymm3
        vpsrlw  $7, %ymm3, %ymm3

* Target is a struct
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants