Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile/ssa: conditional select instructions #21391

Closed
philhofer opened this issue Aug 10, 2017 · 15 comments
Closed

cmd/compile/ssa: conditional select instructions #21391

philhofer opened this issue Aug 10, 2017 · 15 comments

Comments

@philhofer
Copy link
Contributor

philhofer commented Aug 10, 2017

Consider: (from https://go-review.googlesource.com/c/54656)

	if B > 15 {
		B = 15
	}

Right now, the best the amd64 backend could do for this basic block is generate a conditional branch forward over the B = 15 statement. Ideally, though, we'd have it generate a CMOV instead.

EDIT (after some experimentation):

Here's my proposal:

Introduce a new family of generic SSA instructions. For now, let's say it's just CondSelect64, CondSelect32, CondSelect16, and CondSelect8, all with the form (CondSelectXX a b bool) and the semantics bool ? a : b. (It doesn't appear FPU conditional moves are widely supported, so let's forget about that for now.)

Then, we add an additional pass to the SSA backend to recognize CFGs with the following form:

bb0
 | \
 |  bb1
 | /
bb2

where bb1 is "trivial" (for some definition of trivial). We can combine all three basic blocks into a single basic block where bb2's Phi instructions are replaced with the appropriate conditional moves.

I wrote the pass to detect this particular CFG arrangement and limited it to the case in which bb1 has two or fewer instructions and all of those instructions have no side-effects. When building the toolchain and stdlib, I recorded 3755 such cases.

On the architectures with which I am familiar (arm64, amd64, arm), the rewrite rules for the CondSelect instruction are trivial. However, I don't think I can implement the backend rewrite rules on all architectures (I don't have access to the hardware), so the pass that introduces those instructions will have to be gated to just those architectures for which the lowering rules are implemented.

CC @josharian @randall77

@randall77
Copy link
Contributor

It will be a bit tricky to represent in SSA. You will need additional args to any conditionally executed Value. You'll certainly need a flags value to direct the conditionality. And possibly also the old values of all the registers that get conditionally overwritten. So it isn't as simple as adding a few bits to a Value and reusing already existing opcodes.

Another option is to do it in SSA->Prog. We could detect & rewrite there more cleanly. It would mean duplicating the logic for each arch, though.

@philhofer
Copy link
Contributor Author

I'm okay with doing it in SSA->Prog. I had some hope that maybe the logic for detecting these opportunities could be made portable, but I guess we'll see how realistic that is once I start writing the code. Perhaps it can work cleanly as a phase that runs after register allocation.

@philhofer
Copy link
Contributor Author

Alternatively, we do what LLVM does and introduce a "select" instruction in the portable SSA (we'd have to name it something different, I guess, since we already have one), and then lower that to the right conditional moves later.

@TocarIP
Copy link
Contributor

TocarIP commented Aug 11, 2017

Do we have a good way to find unpredictable branches?
For well predicated (especially not-taken) branches using predicated mov may be a pessimization.

@philhofer philhofer changed the title cmd/compile/ssa: predicated blocks cmd/compile/ssa: conditional select instructions Aug 12, 2017
@philhofer
Copy link
Contributor Author

I updated the issue description and issue title to reflect the work I've been doing on this. I have a better idea about how this will work now.

Re: performance, replacing a test-and-branch with a test-and-cmov on is still two instructions on amd64. Why would that be slower, even when the branch is never taken? I'd expect both to be retired in one cycle in the ordinary case. Let's see what the benchmarks say.

@ianlancetaylor
Copy link
Member

Different instructions take different amounts of time. On some old Intel processors, the conditional move instruction is very slow. That is just an example--conditional moves are fast enough on modern processors. But I think the real win is when you can use them for unpredictable conditionals, to avoid branch mispredictions. Even today I don't think they are a win if the conditional branch is very predictable.

@philhofer
Copy link
Contributor Author

Different instructions take different amounts of time.

I agree!

On some old Intel processors, the conditional move instruction is very slow. That is just an example--conditional moves are fast enough on modern processors.

Yes; I seem to remember a Torvalds rant about that re: Pentium III and IV. I assume that it's fixed by now. On an ARM64 chip, a CSEL costs exactly the same amount as a regular MOV, and the same amount as a perfectly-predicted branch. The same is true for ARM (unexecuted predicated instructions burn one cycle.) I think the old x86 hardware is the exception rather than the rule. We can always not do this optimization for x86 if it turns out this is still the case.

But I think the real win is when you can use them for unpredictable conditionals, to avoid branch mispredictions. Even today I don't think they are a win if the conditional branch is very predictable.

This I'm not entirely sure about that, for a couple of reasons:

  1. If the (un-taken) basic block (bb1, in the diagram up top) is laid out between bb0 and bb2, then not only is the jump taken (forwards!), but the effective code density is lower. If the compiler happens to lay things out such that bb1 is a forward branch and bb0 falls through to bb2, then you emit an extra unconditional jump at the bottom of bb1 to get back to bb2. Emitting a conditional move has the same effective code density as the best case.

  2. This is purely anecdotal, but the real-world Go programs that I've had to tune have had profiles that were flat and slow (i.e., 60%+ of the execution time was spent outside the top 5 function calls.) I think that's a natural state for many (bloated) programs after they've had obvious hot spots optimized. Those programs aren't going to get much help from dynamic branch prediction in the first place; branch caches aren't that big. But, code size reductions do help the program counter trudge through all that bloat a bit quicker.

Let's see what the benchmarks say once I've run them.

@josharian
Copy link
Contributor

branch caches aren't that big

Also, the discussion in #18977 around branch cache collisions (search for "hash(IP")) makes me suspect that there is non-local, stochastic value in removing branches outside of hot code.

It's sometimes also instructive to see what gcc/llvm does with similar C code. Also, some conditional calculations can be translated into branchless code with arithmetic tricks; not sure whether those are common enough to worth encoding into rules. cc @martisch on the last front.

@martisch
Copy link
Contributor

I think if we try this we should start only with simple cases like:

if cond {
		B = 15
}

and

if cond {
                B = 15
} else {
                B = 1
}

since these do not have data dependencies.
Bool to int conversions of the form if cond { b = 1 } else { b = 0 } should be left to SETcc on amd64.

I am less worried that cmov itself is slow on modern amd64 but instead would be careful instead if the assignment to B is something that is not immediately available and introduces a stall waiting for that data which would have been avoided in a correctly predicated branch.

We could also survey what LLVM and gcc have opted to optimize with cmov to collect more information about trade offs.

Re: performance, replacing a test-and-branch with a test-and-cmov on is still two instructions on amd64.

Many test-branch pairs are macro-op fused to one internal op on modern amd64.
Intel® 64 and IA-32 Architectures Optimization Reference Manual 3.4.2.2
Optimizing for Macro-fusion

/cc @TocarIP

@philhofer
Copy link
Contributor Author

philhofer commented Aug 13, 2017

I have a draft of this change on arm64 here: https://github.com/philhofer/go/tree/arm64-csel

Gzip gets 10% faster (!); fannkuch also gets a little bump. Not much else does.

name                      old time/op    new time/op    delta
BinaryTree17-96              15.9s ± 6%     16.4s ± 9%     ~     (p=0.075 n=10+10)
Fannkuch11-96                7.17s ± 0%     7.05s ± 0%   -1.74%  (p=0.000 n=8+9)
FmtFprintfEmpty-96           208ns ± 1%     208ns ± 0%     ~     (p=0.092 n=10+9)
FmtFprintfString-96          379ns ± 0%     378ns ± 0%     ~     (p=0.470 n=10+10)
FmtFprintfInt-96             385ns ± 0%     388ns ± 0%   +0.78%  (p=0.000 n=9+10)
FmtFprintfIntInt-96          591ns ± 0%     592ns ± 0%   +0.22%  (p=0.000 n=7+10)
FmtFprintfPrefixedInt-96     656ns ± 0%     668ns ± 0%   +1.78%  (p=0.000 n=10+10)
FmtFprintfFloat-96           967ns ± 0%     988ns ± 0%   +2.23%  (p=0.000 n=10+10)
FmtManyArgs-96              2.35µs ± 0%    2.33µs ± 0%   -0.87%  (p=0.000 n=9+10)
GobDecode-96                31.0ms ± 0%    30.5ms ± 2%   -1.54%  (p=0.003 n=9+9)
GobEncode-96                24.4ms ± 0%    24.4ms ± 0%   +0.13%  (p=0.000 n=9+9)
Gzip-96                      1.60s ± 0%     1.43s ± 0%  -10.43%  (p=0.000 n=9+10)
Gunzip-96                    167ms ± 0%     168ms ± 0%   +0.73%  (p=0.000 n=8+9)
HTTPClientServer-96          311µs ± 1%     312µs ± 0%   +0.60%  (p=0.000 n=10+9)
JSONEncode-96               65.0ms ± 0%    65.0ms ± 0%   -0.08%  (p=0.000 n=9+9)
JSONDecode-96                262ms ± 1%     255ms ± 0%   -2.56%  (p=0.000 n=10+10)
Mandelbrot200-96            18.0ms ± 0%    18.1ms ± 0%   +0.58%  (p=0.000 n=8+10)
GoParse-96                  14.0ms ± 0%    14.1ms ± 3%   +1.08%  (p=0.004 n=9+10)
RegexpMatchEasy0_32-96       644ns ± 2%     644ns ± 3%     ~     (p=0.513 n=10+10)
RegexpMatchEasy0_1K-96      3.70µs ± 0%    3.71µs ± 1%     ~     (p=0.464 n=10+10)
RegexpMatchEasy1_32-96       662ns ± 2%     663ns ± 3%     ~     (p=0.470 n=10+10)
RegexpMatchEasy1_1K-96      4.47µs ± 0%    4.46µs ± 1%     ~     (p=0.270 n=10+10)
RegexpMatchMedium_32-96      844ns ± 2%     855ns ± 3%     ~     (p=0.136 n=10+10)
RegexpMatchMedium_1K-96      179µs ± 0%     179µs ± 0%   -0.15%  (p=0.000 n=10+8)
RegexpMatchHard_32-96       10.0µs ± 0%    10.1µs ± 0%   +0.63%  (p=0.000 n=10+10)
RegexpMatchHard_1K-96        297µs ± 0%     298µs ± 0%   +0.17%  (p=0.000 n=10+8)
Revcomp-96                   3.08s ± 0%     3.09s ± 0%   +0.36%  (p=0.000 n=9+9)
Template-96                  276ms ± 2%     272ms ± 1%   -1.36%  (p=0.000 n=10+9)
TimeParse-96                1.37µs ± 0%    1.37µs ± 0%     ~     (p=0.487 n=10+10)
TimeFormat-96               1.40µs ± 0%    1.40µs ± 0%   -0.41%  (p=0.000 n=10+10)
[Geo mean]                   264µs          263µs        -0.22%

@TocarIP
Copy link
Contributor

TocarIP commented Aug 14, 2017

I was mainly worried about things like search for max in array. When looking at #16141 case, maxsd (which is the best case for conditional move) was ~2x slower than branch, because it introduced data dependency.

@philhofer does your change work on code like this:

if bit == 0xffff {
     nodeIndex = node.left
} else {
     nodeIndex = node.right
}

Were left/right/nodeindex are uint16?

@philhofer
Copy link
Contributor Author

philhofer commented Aug 14, 2017 via email

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/55670 mentions this issue: cmd/compile/internal/ssa: emit csel on arm64

@navytux
Copy link
Contributor

navytux commented Feb 20, 2018

Above patch merged: 2d0172c.

@bradfitz bradfitz added this to the Unplanned milestone Feb 20, 2018
@rasky
Copy link
Member

rasky commented Mar 11, 2018

This has been implemented and merged, the CL was missing the issue reference to close it.

@rasky rasky closed this as completed Mar 11, 2018
@golang golang locked and limited conversation to collaborators Mar 11, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

10 participants