Compute comparison masks in narrower types if possible #7392

abadams · 2023-03-03T18:15:58Z

In various circumstances (e.g. boundary conditions) we generate comparisons between ramps and broadcasts and use them either as a load/store predicate, or as a select argument. The comparison is currently done in 32-bit. If used to mux between narrow types, this generates multiple vectors of comparison mask, which are then narrowed. With some care, it's possible to instead perform the comparison directly in the narrow type.

For example if we're selecting between uint8s, and we have the condition:
ramp(x, 1, 16) < broadcast(y)
where x is an Int(32), we can rewrite this to:
cast<int8_t>(ramp(0, 1, 16)) < broadcast(saturating_cast<int8_t>(saturating_sub(y, x)))
because the ramp can't take on the extreme values of an int8

For an example of the assembly generated, consider the uint8 expression select(x < 50, f(x), 17). On main when vectorized 64-wide for avx512 this compiles to:

	leal	(%rbx,%rsi), %edi
	vpbroadcastd	%edi, %zmm6
	vpaddd	%zmm0, %zmm6, %zmm7
	vpaddd	%zmm1, %zmm6, %zmm8
	vpaddd	%zmm2, %zmm6, %zmm9
	vpaddd	%zmm3, %zmm6, %zmm6
	vpcmpgtd	%zmm6, %zmm4, %k0
	vpcmpgtd	%zmm9, %zmm4, %k1
	vpcmpgtd	%zmm8, %zmm4, %k2
	vpcmpgtd	%zmm7, %zmm4, %k3
	kunpckwd	%k0, %k1, %k0
	kunpckwd	%k2, %k3, %k1
	kunpckdq	%k0, %k1, %k1
	vpblendmb	(%rax,%rsi), %zmm5, %zmm6 {%k1}
	vmovdqu64	%zmm6, (%r15,%rsi)
	addq	$64, %rsi
	cmpq	%rsi, %rdx
	jne	.LBB0_21

In this branch it generates:

	cmpl	$127, %esi
	movl	$127, %r9d
	cmovll	%esi, %r9d
	cmpl	$-127, %r9d
	cmovll	%r8d, %r9d
	vpbroadcastb	%r9d, %zmm2
	vpcmpgtb	%zmm0, %zmm2, %k1
	vpblendmb	(%rax,%rdi), %zmm1, %zmm2 {%k1}
	vmovdqu64	%zmm2, (%r15,%rdi)
	addq	$64, %rdi
	addl	$-64, %esi
	cmpq	%rdi, %rdx
	jne	.LBB0_21

According to llvm-mca, the latter is 3x faster than the former.

steven-johnson · 2023-03-03T18:19:08Z

the latter is 3x faster than the former

!!!

(I'll definitely be pulling this into google3 for some testing once we get some buildbots green)

steven-johnson · 2023-03-03T18:19:47Z

src/FindIntrinsics.cpp

+ return !narrow_predicate(op->a, t);
+ }
+
+ const LT *lt = p.as<LT>();


Presumably we've normalized away the GT and GE at this point?

Yes, those should be long-gone

abadams · 2023-03-03T18:23:04Z

I think I can avoid the issue of y-x overflowing if I use saturating_sub. Given that it's feeding into a saturating_narrow, I don't think it should change the result compared to an infinite-precision y-x. Does that sound right @rootjalex ?

steven-johnson · 2023-03-03T18:35:16Z

(Looks like some of the buildbots happened to sync to bad LLVM17 revs last night -- I'm forcing rebuilds on those. Linux will be ready soon, but the armbots will take ~all day)

rootjalex · 2023-03-03T19:15:41Z

src/FindIntrinsics.cpp

+ auto rewrite = IRMatcher::rewriter(p, Int(32, lanes));
+
+ // Construct predicates which state the ramp can't hit the extreme
+ // values of an int8 or an int16. This is an overconservative condition,


At first I was going to suggest saturating_narrow on the ramp because it's faster (on x86 at least, and possible others?) before I realized the rewritten ramps below are all constant-folded. If we instead used bounds inference to prove this for symbolic ramps, we should probably saturating_cast the ramp as well, right?

rootjalex · 2023-03-03T19:22:43Z

src/FindIntrinsics.cpp

+ auto min_ramp_lane = min(c0, c0 * (lanes - 1));
+ auto max_ramp_lane = max(c0, c0 * (lanes - 1));
+ auto ramp_fits_in_i8 = min_ramp_lane > -128 && max_ramp_lane < 127;
+ auto ramp_fits_in_i16 = min_ramp_lane > -32768 && max_ramp_lane < 32767;


Might want a comment here explaining why these inequalities are strict. It took me a minute to work through why.

Is the comment immediately above not sufficient? I'm checking they can't hit the extreme values of the narrower type. I think in some cases it's fine, but it's quite hard to think about.

I think an explanation of why this condition is necessary would be helpful. Also I think the inequality only needs to be strict for a strict <, pretty sure the inequality can be non-strict (is there a word for this?) for a <=

Comment updated. I found it very hard to think through the cases, and just ended up using the most conservative condition for all of them.

rootjalex · 2023-03-04T01:22:47Z

Seems like some legit failures producing signed integer overflow in constant folding

…ates

The existing lowering was prone to overflow

abadams · 2023-03-20T01:54:34Z

The signed overflow was caused by our lowering of saturating_add and saturating_sub flirting with int32 overflow in the simplifier by introducing INT_MAX and INT_MIN constants. I rewrote the lowering to only do unsigned math (and to use substantially fewer ops). I brute-force checked the correctness of the new lowerings for int8 types.

…ates

abadams · 2023-03-24T15:41:07Z

ptaal

rootjalex · 2023-03-24T18:37:01Z

I brute-force checked the correctness of the new lowerings for int8 types.

I'd be more comfortable with the changes if they were formally verified. I'd offer to do it myself but don't have time to do it with traveling this week.

rootjalex · 2023-03-24T18:39:14Z

src/FindIntrinsics.cpp

+ Expr ua = cast(u, a);
+ Expr ub = cast(u, b);


Should these just be reinterprets?

rootjalex · 2023-03-24T18:43:32Z

src/FindIntrinsics.cpp

+ Expr ua = cast(u, a);
+ Expr ub = cast(u, b);


These should probably also just be reinterprets?

abadams · 2023-03-24T19:31:30Z

wrt the new lowerings, I'm pretty confident that if they're checked correct for int8 then they're right for any signed integer, so I don't think those need formal verification.

Did you also want to see formal verification of the rewrite rules in FindIntrinsics.cpp? I might wait until you're not travelling and then get you to show me the easiest way to do that. I haven't needed to do it in a while.

rootjalex · 2023-03-24T19:39:24Z

I was referring to the new lowerings because they are harder for me to mentally convince myself about than the additions to FindIntrinsics.cpp, but now that you mention it, those could also be verified (by scalarizing the ramps).

I'm happy to show you how I'd do it after traveling, but don't necessarily want to hold this PR back (I probably won't have time until the first week of April). I am reasonably convinced of the correctness of these rewrites so will approve this

* Compute comparison masks in narrower types if possible * Remove reliance on infinite precision int32s * Further elaborate on comment * Lower signed saturating_add and sub to unsigned math The existing lowering was prone to overflow * cast -> reinterpret

Compute comparison masks in narrower types if possible

7d4aa37

abadams requested a review from rootjalex March 3, 2023 18:16

steven-johnson reviewed Mar 3, 2023

View reviewed changes

Remove reliance on infinite precision int32s

90a2652

rootjalex reviewed Mar 3, 2023

View reviewed changes

Further elaborate on comment

99d8d4f

Merge remote-tracking branch 'origin/main' into abadams/narrow_predic…

00fecd2

…ates

abadams requested a review from halidebuildbots March 17, 2023 19:11

abadams added 3 commits March 17, 2023 12:13

Merge remote-tracking branch 'origin/main' into abadams/narrow_predic…

5f04a1f

…ates

Merge remote-tracking branch 'origin/main' into abadams/narrow_predic…

7a2e7f0

…ates

Lower signed saturating_add and sub to unsigned math

a7aa8cc

The existing lowering was prone to overflow

Merge remote-tracking branch 'origin/main' into abadams/narrow_predic…

30144e6

…ates

rootjalex reviewed Mar 24, 2023

View reviewed changes

cast -> reinterpret

6f97a80

rootjalex approved these changes Mar 24, 2023

View reviewed changes

abadams merged commit ab5f042 into main Mar 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compute comparison masks in narrower types if possible #7392

Compute comparison masks in narrower types if possible #7392

abadams commented Mar 3, 2023 •

edited

Loading

steven-johnson commented Mar 3, 2023

steven-johnson Mar 3, 2023

abadams Mar 3, 2023

abadams commented Mar 3, 2023

steven-johnson commented Mar 3, 2023

rootjalex Mar 3, 2023

rootjalex Mar 3, 2023

abadams Mar 3, 2023

rootjalex Mar 3, 2023

abadams Mar 3, 2023

rootjalex commented Mar 4, 2023

abadams commented Mar 20, 2023

abadams commented Mar 24, 2023

rootjalex commented Mar 24, 2023

rootjalex Mar 24, 2023

rootjalex Mar 24, 2023

abadams commented Mar 24, 2023

rootjalex commented Mar 24, 2023

Compute comparison masks in narrower types if possible #7392

Compute comparison masks in narrower types if possible #7392

Conversation

abadams commented Mar 3, 2023 • edited Loading

steven-johnson commented Mar 3, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abadams commented Mar 3, 2023

steven-johnson commented Mar 3, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rootjalex commented Mar 4, 2023

abadams commented Mar 20, 2023

abadams commented Mar 24, 2023

rootjalex commented Mar 24, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abadams commented Mar 24, 2023

rootjalex commented Mar 24, 2023

abadams commented Mar 3, 2023 •

edited

Loading