cmd/compile/ssa: conditional select instructions #21391

philhofer · 2017-08-10T16:57:19Z

Consider: (from https://go-review.googlesource.com/c/54656)

	if B > 15 {
		B = 15
	}

Right now, the best the amd64 backend could do for this basic block is generate a conditional branch forward over the B = 15 statement. Ideally, though, we'd have it generate a CMOV instead.

EDIT (after some experimentation):

Here's my proposal:

Introduce a new family of generic SSA instructions. For now, let's say it's just CondSelect64, CondSelect32, CondSelect16, and CondSelect8, all with the form (CondSelectXX a b bool) and the semantics bool ? a : b. (It doesn't appear FPU conditional moves are widely supported, so let's forget about that for now.)

Then, we add an additional pass to the SSA backend to recognize CFGs with the following form:

bb0
 | \
 |  bb1
 | /
bb2

where bb1 is "trivial" (for some definition of trivial). We can combine all three basic blocks into a single basic block where bb2's Phi instructions are replaced with the appropriate conditional moves.

I wrote the pass to detect this particular CFG arrangement and limited it to the case in which bb1 has two or fewer instructions and all of those instructions have no side-effects. When building the toolchain and stdlib, I recorded 3755 such cases.

On the architectures with which I am familiar (arm64, amd64, arm), the rewrite rules for the CondSelect instruction are trivial. However, I don't think I can implement the backend rewrite rules on all architectures (I don't have access to the hardware), so the pass that introduces those instructions will have to be gated to just those architectures for which the lowering rules are implemented.

CC @josharian @randall77

The text was updated successfully, but these errors were encountered:

randall77 · 2017-08-10T21:38:38Z

It will be a bit tricky to represent in SSA. You will need additional args to any conditionally executed Value. You'll certainly need a flags value to direct the conditionality. And possibly also the old values of all the registers that get conditionally overwritten. So it isn't as simple as adding a few bits to a Value and reusing already existing opcodes.

Another option is to do it in SSA->Prog. We could detect & rewrite there more cleanly. It would mean duplicating the logic for each arch, though.

philhofer · 2017-08-10T21:54:45Z

I'm okay with doing it in SSA->Prog. I had some hope that maybe the logic for detecting these opportunities could be made portable, but I guess we'll see how realistic that is once I start writing the code. Perhaps it can work cleanly as a phase that runs after register allocation.

philhofer · 2017-08-10T22:03:17Z

Alternatively, we do what LLVM does and introduce a "select" instruction in the portable SSA (we'd have to name it something different, I guess, since we already have one), and then lower that to the right conditional moves later.

TocarIP · 2017-08-11T20:11:45Z

Do we have a good way to find unpredictable branches?
For well predicated (especially not-taken) branches using predicated mov may be a pessimization.

philhofer · 2017-08-12T21:01:35Z

I updated the issue description and issue title to reflect the work I've been doing on this. I have a better idea about how this will work now.

Re: performance, replacing a test-and-branch with a test-and-cmov on is still two instructions on amd64. Why would that be slower, even when the branch is never taken? I'd expect both to be retired in one cycle in the ordinary case. Let's see what the benchmarks say.

ianlancetaylor · 2017-08-12T21:50:51Z

Different instructions take different amounts of time. On some old Intel processors, the conditional move instruction is very slow. That is just an example--conditional moves are fast enough on modern processors. But I think the real win is when you can use them for unpredictable conditionals, to avoid branch mispredictions. Even today I don't think they are a win if the conditional branch is very predictable.

philhofer · 2017-08-12T23:20:35Z

Different instructions take different amounts of time.

I agree!

On some old Intel processors, the conditional move instruction is very slow. That is just an example--conditional moves are fast enough on modern processors.

Yes; I seem to remember a Torvalds rant about that re: Pentium III and IV. I assume that it's fixed by now. On an ARM64 chip, a CSEL costs exactly the same amount as a regular MOV, and the same amount as a perfectly-predicted branch. The same is true for ARM (unexecuted predicated instructions burn one cycle.) I think the old x86 hardware is the exception rather than the rule. We can always not do this optimization for x86 if it turns out this is still the case.

But I think the real win is when you can use them for unpredictable conditionals, to avoid branch mispredictions. Even today I don't think they are a win if the conditional branch is very predictable.

This I'm not entirely sure about that, for a couple of reasons:

If the (un-taken) basic block (bb1, in the diagram up top) is laid out between bb0 and bb2, then not only is the jump taken (forwards!), but the effective code density is lower. If the compiler happens to lay things out such that bb1 is a forward branch and bb0 falls through to bb2, then you emit an extra unconditional jump at the bottom of bb1 to get back to bb2. Emitting a conditional move has the same effective code density as the best case.
This is purely anecdotal, but the real-world Go programs that I've had to tune have had profiles that were flat and slow (i.e., 60%+ of the execution time was spent outside the top 5 function calls.) I think that's a natural state for many (bloated) programs after they've had obvious hot spots optimized. Those programs aren't going to get much help from dynamic branch prediction in the first place; branch caches aren't that big. But, code size reductions do help the program counter trudge through all that bloat a bit quicker.

Let's see what the benchmarks say once I've run them.

josharian · 2017-08-13T01:35:35Z

branch caches aren't that big

Also, the discussion in #18977 around branch cache collisions (search for "hash(IP")) makes me suspect that there is non-local, stochastic value in removing branches outside of hot code.

It's sometimes also instructive to see what gcc/llvm does with similar C code. Also, some conditional calculations can be translated into branchless code with arithmetic tricks; not sure whether those are common enough to worth encoding into rules. cc @martisch on the last front.

martisch · 2017-08-13T06:43:34Z

I think if we try this we should start only with simple cases like:

if cond {
		B = 15
}

and

if cond {
                B = 15
} else {
                B = 1
}

since these do not have data dependencies.
Bool to int conversions of the form if cond { b = 1 } else { b = 0 } should be left to SETcc on amd64.

I am less worried that cmov itself is slow on modern amd64 but instead would be careful instead if the assignment to B is something that is not immediately available and introduces a stall waiting for that data which would have been avoided in a correctly predicated branch.

We could also survey what LLVM and gcc have opted to optimize with cmov to collect more information about trade offs.

Re: performance, replacing a test-and-branch with a test-and-cmov on is still two instructions on amd64.

Many test-branch pairs are macro-op fused to one internal op on modern amd64.
Intel® 64 and IA-32 Architectures Optimization Reference Manual 3.4.2.2
Optimizing for Macro-fusion

/cc @TocarIP

philhofer · 2017-08-13T23:59:49Z

I have a draft of this change on arm64 here: https://github.com/philhofer/go/tree/arm64-csel

Gzip gets 10% faster (!); fannkuch also gets a little bump. Not much else does.

name                      old time/op    new time/op    delta
BinaryTree17-96              15.9s ± 6%     16.4s ± 9%     ~     (p=0.075 n=10+10)
Fannkuch11-96                7.17s ± 0%     7.05s ± 0%   -1.74%  (p=0.000 n=8+9)
FmtFprintfEmpty-96           208ns ± 1%     208ns ± 0%     ~     (p=0.092 n=10+9)
FmtFprintfString-96          379ns ± 0%     378ns ± 0%     ~     (p=0.470 n=10+10)
FmtFprintfInt-96             385ns ± 0%     388ns ± 0%   +0.78%  (p=0.000 n=9+10)
FmtFprintfIntInt-96          591ns ± 0%     592ns ± 0%   +0.22%  (p=0.000 n=7+10)
FmtFprintfPrefixedInt-96     656ns ± 0%     668ns ± 0%   +1.78%  (p=0.000 n=10+10)
FmtFprintfFloat-96           967ns ± 0%     988ns ± 0%   +2.23%  (p=0.000 n=10+10)
FmtManyArgs-96              2.35µs ± 0%    2.33µs ± 0%   -0.87%  (p=0.000 n=9+10)
GobDecode-96                31.0ms ± 0%    30.5ms ± 2%   -1.54%  (p=0.003 n=9+9)
GobEncode-96                24.4ms ± 0%    24.4ms ± 0%   +0.13%  (p=0.000 n=9+9)
Gzip-96                      1.60s ± 0%     1.43s ± 0%  -10.43%  (p=0.000 n=9+10)
Gunzip-96                    167ms ± 0%     168ms ± 0%   +0.73%  (p=0.000 n=8+9)
HTTPClientServer-96          311µs ± 1%     312µs ± 0%   +0.60%  (p=0.000 n=10+9)
JSONEncode-96               65.0ms ± 0%    65.0ms ± 0%   -0.08%  (p=0.000 n=9+9)
JSONDecode-96                262ms ± 1%     255ms ± 0%   -2.56%  (p=0.000 n=10+10)
Mandelbrot200-96            18.0ms ± 0%    18.1ms ± 0%   +0.58%  (p=0.000 n=8+10)
GoParse-96                  14.0ms ± 0%    14.1ms ± 3%   +1.08%  (p=0.004 n=9+10)
RegexpMatchEasy0_32-96       644ns ± 2%     644ns ± 3%     ~     (p=0.513 n=10+10)
RegexpMatchEasy0_1K-96      3.70µs ± 0%    3.71µs ± 1%     ~     (p=0.464 n=10+10)
RegexpMatchEasy1_32-96       662ns ± 2%     663ns ± 3%     ~     (p=0.470 n=10+10)
RegexpMatchEasy1_1K-96      4.47µs ± 0%    4.46µs ± 1%     ~     (p=0.270 n=10+10)
RegexpMatchMedium_32-96      844ns ± 2%     855ns ± 3%     ~     (p=0.136 n=10+10)
RegexpMatchMedium_1K-96      179µs ± 0%     179µs ± 0%   -0.15%  (p=0.000 n=10+8)
RegexpMatchHard_32-96       10.0µs ± 0%    10.1µs ± 0%   +0.63%  (p=0.000 n=10+10)
RegexpMatchHard_1K-96        297µs ± 0%     298µs ± 0%   +0.17%  (p=0.000 n=10+8)
Revcomp-96                   3.08s ± 0%     3.09s ± 0%   +0.36%  (p=0.000 n=9+9)
Template-96                  276ms ± 2%     272ms ± 1%   -1.36%  (p=0.000 n=10+9)
TimeParse-96                1.37µs ± 0%    1.37µs ± 0%     ~     (p=0.487 n=10+10)
TimeFormat-96               1.40µs ± 0%    1.40µs ± 0%   -0.41%  (p=0.000 n=10+10)
[Geo mean]                   264µs          263µs        -0.22%

TocarIP · 2017-08-14T16:20:06Z

I was mainly worried about things like search for max in array. When looking at #16141 case, maxsd (which is the best case for conditional move) was ~2x slower than branch, because it introduced data dependency.

@philhofer does your change work on code like this:

if bit == 0xffff {
     nodeIndex = node.left
} else {
     nodeIndex = node.right
}

Were left/right/nodeindex are uint16?

philhofer · 2017-08-14T16:56:52Z

Not yet. Right now only 'if' blocks are collapsed, but it should be simple to recognize and collapse if+else, too.

…

On Mon, Aug 14, 2017 at 9:21 AM, Ilya Tocar ***@***.***> wrote: I was mainly worried about things like search for max in array. When looking at #16141 <#16141> case, maxsd (which is the best case for conditional move) was ~2x slower than branch, because it introduced data dependency. @philhofer <https://github.com/philhofer> does your change work on code like this: if bit == 0xffff { nodeIndex = node.left } else { nodeIndex = node.right } Were left/right/nodeindex are uint16? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#21391 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACzf5jz2VFEh0TE9ZfIkLsmyeMcDRCycks5sYHQAgaJpZM4OztOA> .

gopherbot · 2017-08-15T04:31:39Z

Change https://golang.org/cl/55670 mentions this issue: cmd/compile/internal/ssa: emit csel on arm64

navytux · 2018-02-20T08:37:05Z

Above patch merged: 2d0172c.

rasky · 2018-03-11T09:46:07Z

This has been implemented and merged, the CL was missing the issue reference to close it.

josharian added the Performance label Aug 11, 2017

philhofer changed the title ~~cmd/compile/ssa: predicated blocks~~ cmd/compile/ssa: conditional select instructions Aug 12, 2017

bradfitz added this to the Unplanned milestone Feb 20, 2018

rasky closed this as completed Mar 11, 2018

golang locked and limited conversation to collaborators Mar 11, 2019

gopherbot added the FrozenDueToAge label Mar 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile/ssa: conditional select instructions #21391

cmd/compile/ssa: conditional select instructions #21391

philhofer commented Aug 10, 2017 •

edited

Loading

randall77 commented Aug 10, 2017

philhofer commented Aug 10, 2017

philhofer commented Aug 10, 2017

TocarIP commented Aug 11, 2017

philhofer commented Aug 12, 2017

ianlancetaylor commented Aug 12, 2017

philhofer commented Aug 12, 2017

josharian commented Aug 13, 2017

martisch commented Aug 13, 2017

philhofer commented Aug 13, 2017 •

edited

Loading

TocarIP commented Aug 14, 2017

philhofer commented Aug 14, 2017 via email

gopherbot commented Aug 15, 2017

navytux commented Feb 20, 2018

rasky commented Mar 11, 2018

cmd/compile/ssa: conditional select instructions #21391

cmd/compile/ssa: conditional select instructions #21391

Comments

philhofer commented Aug 10, 2017 • edited Loading

randall77 commented Aug 10, 2017

philhofer commented Aug 10, 2017

philhofer commented Aug 10, 2017

TocarIP commented Aug 11, 2017

philhofer commented Aug 12, 2017

ianlancetaylor commented Aug 12, 2017

philhofer commented Aug 12, 2017

josharian commented Aug 13, 2017

martisch commented Aug 13, 2017

philhofer commented Aug 13, 2017 • edited Loading

TocarIP commented Aug 14, 2017

philhofer commented Aug 14, 2017 via email

gopherbot commented Aug 15, 2017

navytux commented Feb 20, 2018

rasky commented Mar 11, 2018

philhofer commented Aug 10, 2017 •

edited

Loading

philhofer commented Aug 13, 2017 •

edited

Loading