-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite integer lerp using intrinsics #6426
Conversation
Expr divided = ((prod_sum / divisor) + prod_sum) / divisor; | ||
Expr shift = Cast::make(UInt(2 * bits), bits); | ||
Expr prod_sum = widening_mul(zero_val, inverse_typed_weight) + widening_mul(one_val, typed_weight); | ||
Expr divided = rounding_shift_right(rounding_shift_right(prod_sum, shift) + prod_sum, shift); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is equivalent to the original code, but I'm not convinced it's a good idea on x86. I bet you can do better with an average-round-up instruction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a trick to do x/255 as (x/256 + x)/256
We should document that trick in a comment.
I might spend some time trying to figure out how to do that division by 2^n - 1 more cheaply, but that would be a follow-up PR. |
Expr divided = ((prod_sum / divisor) + prod_sum) / divisor; | ||
Expr shift = Cast::make(UInt(2 * bits), bits); | ||
Expr prod_sum = widening_mul(zero_val, inverse_typed_weight) + widening_mul(one_val, typed_weight); | ||
Expr divided = rounding_shift_right(rounding_shift_right(prod_sum, shift) + prod_sum, shift); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a trick to do x/255 as (x/256 + x)/256
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK yeah, division by 2^n-1...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this is a special case of one of our unsigned division lowering methods. I was about to suggest just dividing and relying on that, but that only covers up to 255
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I could easily extend that table to include 65535 and 2^32 - 1 ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw the nasty thing about this trick is that it fails for the largest uint16_t, so you have to have some way to know that's not possible (as we do here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the other hand, x/255 == ((x + 1)/256 + x + 1)/256
seems to hold for all uint16 x.
So maybe the resolution here is teaching the arm backend this trick for division by 255
Edit: Except that it overflows for the largest uint16. Sigh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do any of us have a copy of the book "Hacker's Delight"? IIRC it has quite a long section on this sort of technique, but I last read it at a different employer ~20 years ago...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a useful identity for uint16s:
cast<uint8_t>(x / 255) === let y = saturating_sub(x, 127) in cast<uint8_t>(rounding_shift_right(halving_add(rounding_shift_right(y, 8), y), 7))
On ARM we currently lower x/255 as:
umull v3.4s, v2.4h, v0.4h
umull2 v2.4s, v2.8h, v1.8h
ushr v3.4s, v3.4s, #23
ushr v2.4s, v2.4s, #23
xtn v3.4h, v3.4s
xtn2 v3.8h, v2.4s
xtn v2.8b, v3.8h
but using this identity we get:
uqsub v1.8h, v1.8h, v0.8h
urshr v2.8h, v1.8h, #8
uhadd v1.8h, v2.8h, v1.8h
rshrn v1.8b, v1.8h, #7
Nice.
Edit: nevermind, this also overflows in the addition. Dang.
Second edit: Fixed, by using an averaging op at the cost of an additional instruction. Now it looks a lot like method 2 of our division methods...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do any of us have a copy of the book "Hacker's Delight"? IIRC it has quite a long section on this sort of technique, but I last read it at a different employer ~20 years ago...
@steven-johnson - very late, I know, but I keep a copy on my desk. If it ever comes up again that we want to consult this book, let me know.
My overall conclusion is that we should merge this PR with a comment explaining the trick and a TODO that notes it's actually one instruction cheaper to do the division directly on x86. |
Unrelated, but |
I am going to submit this (tests failure is unrelated) and create an issue to track if there are any better ways to do x / 255 (so discussion in the comments is not lost). |
This actually hurt lerp on x86, because rounding shift right is more expensive than (x + 128) / 256. Before:
After:
I'm changing the lowering on x86 anyway, but are there other targets that don't have a rounding_shift_right instruction? |
Saves instructions on x86. Before #6426 vpaddw %ymm0, %ymm1, %ymm1 vpsrlw $8, %ymm1, %ymm2 vpaddw %ymm2, %ymm1, %ymm1 vpsrlw $8, %ymm1, %ymm1 After #6426 vpsrlw $7, %ymm2, %ymm3 vpand %ymm0, %ymm3, %ymm3 vpsrlw $8, %ymm2, %ymm4 vpaddw %ymm2, %ymm4, %ymm2 vpaddw %ymm3, %ymm2, %ymm2 vpsrlw $7, %ymm2, %ymm3 vpand %ymm0, %ymm3, %ymm3 vpsrlw $8, %ymm2, %ymm2 vpaddw %ymm2, %ymm3, %ymm2 vpand %ymm1, %ymm2, %ymm2 This PR: vpaddw %ymm0, %ymm3, %ymm3 vpmulhuw %ymm1, %ymm3, %ymm3 vpsrlw $7, %ymm3, %ymm3
* Do target-specific lowering of lerp Saves instructions on x86. Before #6426 vpaddw %ymm0, %ymm1, %ymm1 vpsrlw $8, %ymm1, %ymm2 vpaddw %ymm2, %ymm1, %ymm1 vpsrlw $8, %ymm1, %ymm1 After #6426 vpsrlw $7, %ymm2, %ymm3 vpand %ymm0, %ymm3, %ymm3 vpsrlw $8, %ymm2, %ymm4 vpaddw %ymm2, %ymm4, %ymm2 vpaddw %ymm3, %ymm2, %ymm2 vpsrlw $7, %ymm2, %ymm3 vpand %ymm0, %ymm3, %ymm3 vpsrlw $8, %ymm2, %ymm2 vpaddw %ymm2, %ymm3, %ymm2 vpand %ymm1, %ymm2, %ymm2 This PR: vpaddw %ymm0, %ymm3, %ymm3 vpmulhuw %ymm1, %ymm3, %ymm3 vpsrlw $7, %ymm3, %ymm3 * Target is a struct
Should be pretty much the same thing as before, but easier to read and maybe may produce better code.