Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Let lerp lowering incorporate a final cast. #6480

Merged
merged 4 commits into from
Dec 10, 2021
Merged

Conversation

abadams
Copy link
Member

@abadams abadams commented Dec 7, 2021

This lets it save a few instructions on x86 and arm, and probably other CPU targets.

cast(UInt(16), lerp(some_u8s)) produces the following, before and after
this PR

Before:

x86:

vmovdqu	(%r15,%r13), %xmm4
vpmovzxbw	-2(%r15,%r13), %ymm5
vpxor	%xmm0, %xmm4, %xmm6
vpmovzxbw	%xmm6, %ymm6
vpmovzxbw	-1(%r15,%r13), %ymm7
vpmullw	%ymm6, %ymm5, %ymm5
vpmovzxbw	%xmm4, %ymm4
vpmullw	%ymm4, %ymm7, %ymm4
vpaddw	%ymm4, %ymm5, %ymm4
vpaddw	%ymm1, %ymm4, %ymm4
vpmulhuw	%ymm2, %ymm4, %ymm4
vpsrlw	$7, %ymm4, %ymm4
vpand	%ymm3, %ymm4, %ymm4
vmovdqu	%ymm4, (%rbx,%r13,2)
addq	$16, %r13
decq	%r10
jne	.LBB0_10

arm:

ldr	q0, [x17]
ldur	q2, [x17, #-1]
ldur	q1, [x17, #-2]
subs	x0, x0, #1                      // =1
mvn	v3.16b, v0.16b
umull	v4.8h, v2.8b, v0.8b
umull2	v0.8h, v2.16b, v0.16b
umlal	v4.8h, v1.8b, v3.8b
umlal2	v0.8h, v1.16b, v3.16b
urshr	v1.8h, v4.8h, #8
urshr	v2.8h, v0.8h, #8
raddhn	v1.8b, v1.8h, v4.8h
raddhn	v0.8b, v2.8h, v0.8h
ushll	v0.8h, v0.8b, #0
ushll	v1.8h, v1.8b, #0
add	x17, x17, #16                   // =16
stp	q1, q0, [x18, #-16]
add	x18, x18, #32                   // =32
b.ne	.LBB0_10

After:

x86:

vpmovzxbw	-2(%r15,%r13), %ymm3
vmovdqu	(%r15,%r13), %xmm4
vpxor	%xmm0, %xmm4, %xmm5
vpmovzxbw	%xmm5, %ymm5
vpmullw	%ymm5, %ymm3, %ymm3
vpmovzxbw	-1(%r15,%r13), %ymm5
vpmovzxbw	%xmm4, %ymm4
vpmullw	%ymm4, %ymm5, %ymm4
vpaddw	%ymm4, %ymm3, %ymm3
vpaddw	%ymm1, %ymm3, %ymm3
vpmulhuw	%ymm2, %ymm3, %ymm3
vpsrlw	$7, %ymm3, %ymm3
vmovdqu	%ymm3, (%rbp,%r13,2)
addq	$16, %r13
decq	%r10
jne	.LBB0_10

arm:

ldr	q0, [x17]
ldur	q2, [x17, #-1]
ldur	q1, [x17, #-2]
subs	x0, x0, #1                      // =1
mvn	v3.16b, v0.16b
umull	v4.8h, v2.8b, v0.8b
umull2	v0.8h, v2.16b, v0.16b
umlal	v4.8h, v1.8b, v3.8b
umlal2	v0.8h, v1.16b, v3.16b
ursra	v4.8h, v4.8h, #8
ursra	v0.8h, v0.8h, #8
urshr	v1.8h, v4.8h, #8
urshr	v0.8h, v0.8h, #8
add	x17, x17, #16                   // =16
stp	q1, q0, [x18, #-16]
add	x18, x18, #32                   // =32
b.ne	.LBB0_10

So on X86 we skip a pointless and instruction, and on ARM we get a
rounding add and shift right instead of a rounding narrowing add shift
right followed by a widen.

This lets it save a few instructions on x86 and arm.

cast(UInt(16), lerp(some_u8s)) produces the following, before and after
this PR

Before:

x86:

	vmovdqu	(%r15,%r13), %xmm4
	vpmovzxbw	-2(%r15,%r13), %ymm5
	vpxor	%xmm0, %xmm4, %xmm6
	vpmovzxbw	%xmm6, %ymm6
	vpmovzxbw	-1(%r15,%r13), %ymm7
	vpmullw	%ymm6, %ymm5, %ymm5
	vpmovzxbw	%xmm4, %ymm4
	vpmullw	%ymm4, %ymm7, %ymm4
	vpaddw	%ymm4, %ymm5, %ymm4
	vpaddw	%ymm1, %ymm4, %ymm4
	vpmulhuw	%ymm2, %ymm4, %ymm4
	vpsrlw	$7, %ymm4, %ymm4
	vpand	%ymm3, %ymm4, %ymm4
	vmovdqu	%ymm4, (%rbx,%r13,2)
	addq	$16, %r13
	decq	%r10
	jne	.LBB0_10
arm:

	ldr	q0, [x17]
	ldur	q2, [x17, #-1]
	ldur	q1, [x17, #-2]
	subs	x0, x0, #1                      // =1
	mvn	v3.16b, v0.16b
	umull	v4.8h, v2.8b, v0.8b
	umull2	v0.8h, v2.16b, v0.16b
	umlal	v4.8h, v1.8b, v3.8b
	umlal2	v0.8h, v1.16b, v3.16b
	urshr	v1.8h, v4.8h, #8
	urshr	v2.8h, v0.8h, #8
	raddhn	v1.8b, v1.8h, v4.8h
	raddhn	v0.8b, v2.8h, v0.8h
	ushll	v0.8h, v0.8b, #0
	ushll	v1.8h, v1.8b, #0
	add	x17, x17, #16                   // =16
	stp	q1, q0, [x18, #-16]
	add	x18, x18, #32                   // =32
	b.ne	.LBB0_10

After:

x86:

	vpmovzxbw	-2(%r15,%r13), %ymm3
	vmovdqu	(%r15,%r13), %xmm4
	vpxor	%xmm0, %xmm4, %xmm5
	vpmovzxbw	%xmm5, %ymm5
	vpmullw	%ymm5, %ymm3, %ymm3
	vpmovzxbw	-1(%r15,%r13), %ymm5
	vpmovzxbw	%xmm4, %ymm4
	vpmullw	%ymm4, %ymm5, %ymm4
	vpaddw	%ymm4, %ymm3, %ymm3
	vpaddw	%ymm1, %ymm3, %ymm3
	vpmulhuw	%ymm2, %ymm3, %ymm3
	vpsrlw	$7, %ymm3, %ymm3
	vmovdqu	%ymm3, (%rbp,%r13,2)
	addq	$16, %r13
	decq	%r10
	jne	.LBB0_10

arm:

	ldr	q0, [x17]
	ldur	q2, [x17, #-1]
	ldur	q1, [x17, #-2]
	subs	x0, x0, #1                      // =1
	mvn	v3.16b, v0.16b
	umull	v4.8h, v2.8b, v0.8b
	umull2	v0.8h, v2.16b, v0.16b
	umlal	v4.8h, v1.8b, v3.8b
	umlal2	v0.8h, v1.16b, v3.16b
	ursra	v4.8h, v4.8h, #8
	ursra	v0.8h, v0.8h, #8
	urshr	v1.8h, v4.8h, #8
	urshr	v0.8h, v0.8h, #8
	add	x17, x17, #16                   // =16
	stp	q1, q0, [x18, #-16]
	add	x18, x18, #32                   // =32
	b.ne	.LBB0_10

So on X86 we skip a pointless and instruction, and on ARM we get a
rounding add and shift right instead of a rounding narrowing add shift
right followed by a widen.
@steven-johnson
Copy link
Contributor

OSX-x86 bot is down for repairs, probably OK to ignore for this

@steven-johnson
Copy link
Contributor

Looks all green, ok to land?

@abadams
Copy link
Member Author

abadams commented Dec 9, 2021

I could have sworn I saw a real test failure yesterday. Let me run the fuzzer test I added overnight to see if I can find one before we merge.

@abadams
Copy link
Member Author

abadams commented Dec 9, 2021

Running overnight found failures. Do not merge.

@abadams
Copy link
Member Author

abadams commented Dec 9, 2021

There was a "bug" in the test, but really a difference with master and raises a question about the definition. What should lerp return when the weight is a float and it's greater than one? Is it an undefined value? With this PR we sometimes manage to successfully extrapolate in some cases where we used to wrap. I could make it wrap, as it did before, fairly easily.

I think @zvookin originally wrote this lerping code, so tagging him for an opinion on what the behavior should be. I imagine the issue is that on some shader targets the native lerp has UB if the weight is outside [0, 1]?

@dsharletg
Copy link
Contributor

I think for widening lerps, extrapolation makes complete sense, and I'd be very surprised with the wrapping behavior. Hopefully that doesn't conflict with some native lerp implementations, perhaps we shouldn't use the native lerp on such platforms if so.

@abadams
Copy link
Member Author

abadams commented Dec 9, 2021

Specifically it's exprs like ``cast<uint16_t>(lerp(some_u8, some_u8, 1.2f))```

That does not look like it should be able to produce values outside the range of a uint8, but it does with this PR.

@dsharletg
Copy link
Contributor

dsharletg commented Dec 9, 2021 via email

@abadams
Copy link
Member Author

abadams commented Dec 10, 2021

I think even if lerp returns an undefined value for out of range weights, it should produce a uint8 output, even if there's a surrounding cast. I'll change it.

@abadams abadams merged commit 7fe1e2c into master Dec 10, 2021
@mcourteaux
Copy link
Contributor

mcourteaux commented Dec 16, 2021

I'd know why I saw this PR, but what about a hint for the lerp in terms of behavior? Like: extrapolate, clamp, wrap, dontcare. Lowering the lerp can then check how the native lerping intrinsic handles values outside [0-1], and if necessary, modify the statements to comply with the hint. dontcare would be used when you know up front that you will always have [0-1] values; this way, even if the target has UB for lerping outside of [0-1], you get the best performance possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants