Let lerp lowering incorporate a final cast. #6480

abadams · 2021-12-07T21:34:36Z

This lets it save a few instructions on x86 and arm, and probably other CPU targets.

cast(UInt(16), lerp(some_u8s)) produces the following, before and after
this PR

Before:

x86:

vmovdqu	(%r15,%r13), %xmm4
vpmovzxbw	-2(%r15,%r13), %ymm5
vpxor	%xmm0, %xmm4, %xmm6
vpmovzxbw	%xmm6, %ymm6
vpmovzxbw	-1(%r15,%r13), %ymm7
vpmullw	%ymm6, %ymm5, %ymm5
vpmovzxbw	%xmm4, %ymm4
vpmullw	%ymm4, %ymm7, %ymm4
vpaddw	%ymm4, %ymm5, %ymm4
vpaddw	%ymm1, %ymm4, %ymm4
vpmulhuw	%ymm2, %ymm4, %ymm4
vpsrlw	$7, %ymm4, %ymm4
vpand	%ymm3, %ymm4, %ymm4
vmovdqu	%ymm4, (%rbx,%r13,2)
addq	$16, %r13
decq	%r10
jne	.LBB0_10

arm:

ldr	q0, [x17]
ldur	q2, [x17, #-1]
ldur	q1, [x17, #-2]
subs	x0, x0, #1                      // =1
mvn	v3.16b, v0.16b
umull	v4.8h, v2.8b, v0.8b
umull2	v0.8h, v2.16b, v0.16b
umlal	v4.8h, v1.8b, v3.8b
umlal2	v0.8h, v1.16b, v3.16b
urshr	v1.8h, v4.8h, #8
urshr	v2.8h, v0.8h, #8
raddhn	v1.8b, v1.8h, v4.8h
raddhn	v0.8b, v2.8h, v0.8h
ushll	v0.8h, v0.8b, #0
ushll	v1.8h, v1.8b, #0
add	x17, x17, #16                   // =16
stp	q1, q0, [x18, #-16]
add	x18, x18, #32                   // =32
b.ne	.LBB0_10

After:

x86:

vpmovzxbw	-2(%r15,%r13), %ymm3
vmovdqu	(%r15,%r13), %xmm4
vpxor	%xmm0, %xmm4, %xmm5
vpmovzxbw	%xmm5, %ymm5
vpmullw	%ymm5, %ymm3, %ymm3
vpmovzxbw	-1(%r15,%r13), %ymm5
vpmovzxbw	%xmm4, %ymm4
vpmullw	%ymm4, %ymm5, %ymm4
vpaddw	%ymm4, %ymm3, %ymm3
vpaddw	%ymm1, %ymm3, %ymm3
vpmulhuw	%ymm2, %ymm3, %ymm3
vpsrlw	$7, %ymm3, %ymm3
vmovdqu	%ymm3, (%rbp,%r13,2)
addq	$16, %r13
decq	%r10
jne	.LBB0_10

arm:

ldr	q0, [x17]
ldur	q2, [x17, #-1]
ldur	q1, [x17, #-2]
subs	x0, x0, #1                      // =1
mvn	v3.16b, v0.16b
umull	v4.8h, v2.8b, v0.8b
umull2	v0.8h, v2.16b, v0.16b
umlal	v4.8h, v1.8b, v3.8b
umlal2	v0.8h, v1.16b, v3.16b
ursra	v4.8h, v4.8h, #8
ursra	v0.8h, v0.8h, #8
urshr	v1.8h, v4.8h, #8
urshr	v0.8h, v0.8h, #8
add	x17, x17, #16                   // =16
stp	q1, q0, [x18, #-16]
add	x18, x18, #32                   // =32
b.ne	.LBB0_10

So on X86 we skip a pointless and instruction, and on ARM we get a
rounding add and shift right instead of a rounding narrowing add shift
right followed by a widen.

This lets it save a few instructions on x86 and arm. cast(UInt(16), lerp(some_u8s)) produces the following, before and after this PR Before: x86: vmovdqu (%r15,%r13), %xmm4 vpmovzxbw -2(%r15,%r13), %ymm5 vpxor %xmm0, %xmm4, %xmm6 vpmovzxbw %xmm6, %ymm6 vpmovzxbw -1(%r15,%r13), %ymm7 vpmullw %ymm6, %ymm5, %ymm5 vpmovzxbw %xmm4, %ymm4 vpmullw %ymm4, %ymm7, %ymm4 vpaddw %ymm4, %ymm5, %ymm4 vpaddw %ymm1, %ymm4, %ymm4 vpmulhuw %ymm2, %ymm4, %ymm4 vpsrlw $7, %ymm4, %ymm4 vpand %ymm3, %ymm4, %ymm4 vmovdqu %ymm4, (%rbx,%r13,2) addq $16, %r13 decq %r10 jne .LBB0_10 arm: ldr q0, [x17] ldur q2, [x17, #-1] ldur q1, [x17, #-2] subs x0, x0, #1 // =1 mvn v3.16b, v0.16b umull v4.8h, v2.8b, v0.8b umull2 v0.8h, v2.16b, v0.16b umlal v4.8h, v1.8b, v3.8b umlal2 v0.8h, v1.16b, v3.16b urshr v1.8h, v4.8h, #8 urshr v2.8h, v0.8h, #8 raddhn v1.8b, v1.8h, v4.8h raddhn v0.8b, v2.8h, v0.8h ushll v0.8h, v0.8b, #0 ushll v1.8h, v1.8b, #0 add x17, x17, #16 // =16 stp q1, q0, [x18, #-16] add x18, x18, #32 // =32 b.ne .LBB0_10 After: x86: vpmovzxbw -2(%r15,%r13), %ymm3 vmovdqu (%r15,%r13), %xmm4 vpxor %xmm0, %xmm4, %xmm5 vpmovzxbw %xmm5, %ymm5 vpmullw %ymm5, %ymm3, %ymm3 vpmovzxbw -1(%r15,%r13), %ymm5 vpmovzxbw %xmm4, %ymm4 vpmullw %ymm4, %ymm5, %ymm4 vpaddw %ymm4, %ymm3, %ymm3 vpaddw %ymm1, %ymm3, %ymm3 vpmulhuw %ymm2, %ymm3, %ymm3 vpsrlw $7, %ymm3, %ymm3 vmovdqu %ymm3, (%rbp,%r13,2) addq $16, %r13 decq %r10 jne .LBB0_10 arm: ldr q0, [x17] ldur q2, [x17, #-1] ldur q1, [x17, #-2] subs x0, x0, #1 // =1 mvn v3.16b, v0.16b umull v4.8h, v2.8b, v0.8b umull2 v0.8h, v2.16b, v0.16b umlal v4.8h, v1.8b, v3.8b umlal2 v0.8h, v1.16b, v3.16b ursra v4.8h, v4.8h, #8 ursra v0.8h, v0.8h, #8 urshr v1.8h, v4.8h, #8 urshr v0.8h, v0.8h, #8 add x17, x17, #16 // =16 stp q1, q0, [x18, #-16] add x18, x18, #32 // =32 b.ne .LBB0_10 So on X86 we skip a pointless and instruction, and on ARM we get a rounding add and shift right instead of a rounding narrowing add shift right followed by a widen.

steven-johnson · 2021-12-07T22:25:34Z

OSX-x86 bot is down for repairs, probably OK to ignore for this

steven-johnson · 2021-12-08T22:59:57Z

Looks all green, ok to land?

abadams · 2021-12-09T00:05:23Z

I could have sworn I saw a real test failure yesterday. Let me run the fuzzer test I added overnight to see if I can find one before we merge.

abadams · 2021-12-09T13:03:24Z

Running overnight found failures. Do not merge.

abadams · 2021-12-09T13:21:38Z

There was a "bug" in the test, but really a difference with master and raises a question about the definition. What should lerp return when the weight is a float and it's greater than one? Is it an undefined value? With this PR we sometimes manage to successfully extrapolate in some cases where we used to wrap. I could make it wrap, as it did before, fairly easily.

I think @zvookin originally wrote this lerping code, so tagging him for an opinion on what the behavior should be. I imagine the issue is that on some shader targets the native lerp has UB if the weight is outside [0, 1]?

dsharletg · 2021-12-09T23:54:24Z

I think for widening lerps, extrapolation makes complete sense, and I'd be very surprised with the wrapping behavior. Hopefully that doesn't conflict with some native lerp implementations, perhaps we shouldn't use the native lerp on such platforms if so.

abadams · 2021-12-09T23:56:01Z

Specifically it's exprs like ``cast<uint16_t>(lerp(some_u8, some_u8, 1.2f))```

That does not look like it should be able to produce values outside the range of a uint8, but it does with this PR.

dsharletg · 2021-12-09T23:57:52Z

Oh, yeah, that is weird...

…

On Thu, Dec 9, 2021 at 3:56 PM Andrew Adams ***@***.***> wrote: Specifically it's exprs like ``cast<uint16_t>(lerp(some_u8, some_u8, 1.2f))``` That does not look like it should be able to produce values outside the range of a uint8, but it does with this PR. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#6480 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABZFZDBLLLXQGT5KPQH6GGTUQE6ZXANCNFSM5JSH7VOQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

abadams · 2021-12-10T12:50:55Z

I think even if lerp returns an undefined value for out of range weights, it should produce a uint8 output, even if there's a surrounding cast. I'll change it.

mcourteaux · 2021-12-16T08:55:42Z

I'd know why I saw this PR, but what about a hint for the lerp in terms of behavior? Like: extrapolate, clamp, wrap, dontcare. Lowering the lerp can then check how the native lerping intrinsic handles values outside [0-1], and if necessary, modify the statements to comply with the hint. dontcare would be used when you know up front that you will always have [0-1] values; this way, even if the target has UB for lerping outside of [0-1], you get the best performance possible.

abadams added 2 commits December 7, 2021 13:32

Add test

675303c

dsharletg approved these changes Dec 7, 2021

View reviewed changes

Fix bug in test

8251c5b

Don't produce out-of-range lerp values

c54f4a4

abadams merged commit 7fe1e2c into master Dec 10, 2021

mcourteaux mentioned this pull request Dec 16, 2021

When lerping between ints using a float weight, the float weight should get clamped to [0, 1] first. #6493

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Let lerp lowering incorporate a final cast. #6480

Let lerp lowering incorporate a final cast. #6480

abadams commented Dec 7, 2021

steven-johnson commented Dec 7, 2021

steven-johnson commented Dec 8, 2021

abadams commented Dec 9, 2021

abadams commented Dec 9, 2021

abadams commented Dec 9, 2021

dsharletg commented Dec 9, 2021

abadams commented Dec 9, 2021

dsharletg commented Dec 9, 2021 via email

abadams commented Dec 10, 2021

mcourteaux commented Dec 16, 2021 •

edited

Loading

Let lerp lowering incorporate a final cast. #6480

Let lerp lowering incorporate a final cast. #6480

Conversation

abadams commented Dec 7, 2021

steven-johnson commented Dec 7, 2021

steven-johnson commented Dec 8, 2021

abadams commented Dec 9, 2021

abadams commented Dec 9, 2021

abadams commented Dec 9, 2021

dsharletg commented Dec 9, 2021

abadams commented Dec 9, 2021

dsharletg commented Dec 9, 2021 via email

abadams commented Dec 10, 2021

mcourteaux commented Dec 16, 2021 • edited Loading

mcourteaux commented Dec 16, 2021 •

edited

Loading