-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile/ssa: conditional select instructions #21391
Comments
It will be a bit tricky to represent in SSA. You will need additional args to any conditionally executed Value. You'll certainly need a flags value to direct the conditionality. And possibly also the old values of all the registers that get conditionally overwritten. So it isn't as simple as adding a few bits to a Value and reusing already existing opcodes. Another option is to do it in SSA->Prog. We could detect & rewrite there more cleanly. It would mean duplicating the logic for each arch, though. |
I'm okay with doing it in SSA->Prog. I had some hope that maybe the logic for detecting these opportunities could be made portable, but I guess we'll see how realistic that is once I start writing the code. Perhaps it can work cleanly as a phase that runs after register allocation. |
Alternatively, we do what LLVM does and introduce a "select" instruction in the portable SSA (we'd have to name it something different, I guess, since we already have one), and then lower that to the right conditional moves later. |
Do we have a good way to find unpredictable branches? |
I updated the issue description and issue title to reflect the work I've been doing on this. I have a better idea about how this will work now. Re: performance, replacing a test-and-branch with a test-and-cmov on is still two instructions on amd64. Why would that be slower, even when the branch is never taken? I'd expect both to be retired in one cycle in the ordinary case. Let's see what the benchmarks say. |
Different instructions take different amounts of time. On some old Intel processors, the conditional move instruction is very slow. That is just an example--conditional moves are fast enough on modern processors. But I think the real win is when you can use them for unpredictable conditionals, to avoid branch mispredictions. Even today I don't think they are a win if the conditional branch is very predictable. |
I agree!
Yes; I seem to remember a Torvalds rant about that re: Pentium III and IV. I assume that it's fixed by now. On an ARM64 chip, a CSEL costs exactly the same amount as a regular MOV, and the same amount as a perfectly-predicted branch. The same is true for ARM (unexecuted predicated instructions burn one cycle.) I think the old x86 hardware is the exception rather than the rule. We can always not do this optimization for x86 if it turns out this is still the case.
This I'm not entirely sure about that, for a couple of reasons:
Let's see what the benchmarks say once I've run them. |
Also, the discussion in #18977 around branch cache collisions (search for "hash(IP")) makes me suspect that there is non-local, stochastic value in removing branches outside of hot code. It's sometimes also instructive to see what gcc/llvm does with similar C code. Also, some conditional calculations can be translated into branchless code with arithmetic tricks; not sure whether those are common enough to worth encoding into rules. cc @martisch on the last front. |
I think if we try this we should start only with simple cases like:
and
since these do not have data dependencies. I am less worried that cmov itself is slow on modern amd64 but instead would be careful instead if the assignment to B is something that is not immediately available and introduces a stall waiting for that data which would have been avoided in a correctly predicated branch. We could also survey what LLVM and gcc have opted to optimize with cmov to collect more information about trade offs.
Many test-branch pairs are macro-op fused to one internal op on modern amd64. /cc @TocarIP |
I have a draft of this change on arm64 here: https://github.com/philhofer/go/tree/arm64-csel Gzip gets 10% faster (!); fannkuch also gets a little bump. Not much else does.
|
I was mainly worried about things like search for max in array. When looking at #16141 case, maxsd (which is the best case for conditional move) was ~2x slower than branch, because it introduced data dependency. @philhofer does your change work on code like this:
Were left/right/nodeindex are uint16? |
Not yet. Right now only 'if' blocks are collapsed, but it should be simple
to recognize and collapse if+else, too.
…On Mon, Aug 14, 2017 at 9:21 AM, Ilya Tocar ***@***.***> wrote:
I was mainly worried about things like search for max in array. When
looking at #16141 <#16141> case, maxsd
(which is the best case for conditional move) was ~2x slower than branch,
because it introduced data dependency.
@philhofer <https://github.com/philhofer> does your change work on code
like this:
if bit == 0xffff {
nodeIndex = node.left
} else {
nodeIndex = node.right
}
Were left/right/nodeindex are uint16?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#21391 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ACzf5jz2VFEh0TE9ZfIkLsmyeMcDRCycks5sYHQAgaJpZM4OztOA>
.
|
Change https://golang.org/cl/55670 mentions this issue: |
Above patch merged: 2d0172c. |
This has been implemented and merged, the CL was missing the issue reference to close it. |
Consider: (from https://go-review.googlesource.com/c/54656)
Right now, the best the amd64 backend could do for this basic block is generate a conditional branch forward over the
B = 15
statement. Ideally, though, we'd have it generate aCMOV
instead.EDIT (after some experimentation):
Here's my proposal:
Introduce a new family of generic SSA instructions. For now, let's say it's just
CondSelect64
,CondSelect32
,CondSelect16
, andCondSelect8
, all with the form(CondSelectXX a b bool)
and the semanticsbool ? a : b
. (It doesn't appear FPU conditional moves are widely supported, so let's forget about that for now.)Then, we add an additional pass to the SSA backend to recognize CFGs with the following form:
where bb1 is "trivial" (for some definition of trivial). We can combine all three basic blocks into a single basic block where bb2's Phi instructions are replaced with the appropriate conditional moves.
I wrote the pass to detect this particular CFG arrangement and limited it to the case in which bb1 has two or fewer instructions and all of those instructions have no side-effects. When building the toolchain and stdlib, I recorded 3755 such cases.
On the architectures with which I am familiar (arm64, amd64, arm), the rewrite rules for the
CondSelect
instruction are trivial. However, I don't think I can implement the backend rewrite rules on all architectures (I don't have access to the hardware), so the pass that introduces those instructions will have to be gated to just those architectures for which the lowering rules are implemented.CC @josharian @randall77
The text was updated successfully, but these errors were encountered: