cmd/compile: generate BT (bit-test) instruction on 386/amd64 #18943

rasky · 2017-02-05T10:19:19Z

The x86 instruction set has a BT (bit test) opcode, which is the moral equivalent of A & (1<<B), with no write back and only carry flag affected.

LLVM uses it when it detects (at least) one of the following patterns:

    if (a & (1<<b) != 0) 
    if (~a & (1<<b) != 0) 
    if ((a >> b) & 1 != 0) 
    if ((~a >> b) & 1 != 0) 
    if (~(a >> b) & 1 != 0)

(note that the bitwise not is not emitted, it's used to reverse the conditional branch)

Currently, Go generates the whole shift sequence, which is much longer and wastes two registers (the intermediate shift result, and ECX as shift count -- the latter because regalloc always emit a MOV to populate ECX when needed for shifting).

The text was updated successfully, but these errors were encountered:

rasky · 2017-02-05T13:40:52Z

This is useful also for the constant case (that is, when b is constant).

For instance, time on tip has a hasMonotonic constant which is 1<<63. Currently, checks like:

    if t.wall&hasMonotonic {

generate this:

	time.go:166	0x40877f5	488b08			MOVQ 0(AX), CX
	time.go:166	0x40877f8	4889ca			MOVQ CX, DX
	time.go:166	0x40877fb	48c1e93f		SHRQ $0x3f, CX
	time.go:166	0x40877ff	48c1e13f		SHLQ $0x3f, CX
	time.go:166	0x4087803	48c7c3ffffffff		MOVQ $-0x1, BX
	time.go:166	0x408780a	4885cb			TESTQ CX, BX
	time.go:166	0x408780d	746b			JE 0x408787a

while this could actually be a single BTQ $0x3f, CX, and no registers wasted.

(In fact, testing against highest bit could also be done with CMP and the sign flag, but I'm not sure it makes sense to special-case it).

josharian · 2017-02-06T00:14:53Z

I started playing with this, figuring it'd be an easy rule to add. (Haha.)

The natural place to put this is AMD64.rules. However, the rule ends up being really contorted, due to ordering problems--despite my best efforts, the shift gets lowered first, which means the rule needs to match the fully expanded shift pattern. And this will be fragile--it'll be broken by any changes to how shifts get lowered (which might well be the cause of #18946).

I see three mediocre options.

(1) Add a generic OpBitTest, and write a generic matching rule. Let each architecture lower BitTest however appropriate, possibly just expanding it back out to shifts and compares. Upsides: Unified, simple. Downsides: Extra work for architectures without bit test instructions, might still need arch-specific rules as well to catch opportunities introduced after generic rewrites.

(2) Split the arch-specific optimization and lowering into multiple rewrite passes. This could then be done in an early arch pass. Upsides: More targeted, no extra generic ops. Downsides: Having multiple passes seems complicated (how are they specified? do we run each pass once or cycle through them to a fixed point?).

(3) Add some kind of complexity ordering to rulegen, such that more specific rules always match before more general ones. Upsides: Magic. Downsides: Magic, generality is not obviously well-defined, might be harder to implement and to understand, might make rewrite pass slower.

Input requested, both specific to this optimization and more generally on the design question.

Somewhat related: #16270 and #15300.

josharian · 2017-02-06T01:51:56Z

(4) Force through the contorted rule and rely on the test to detect breakage.

I'm leaning towards (1), guarded by a condition checking arch support for bittest.

randall77 · 2017-02-06T04:11:26Z

Yes, the logic needed for >63 bit shifts throws a wrench into the rewrite rules. That same wrench hits the bit test rules also, so that's unsurprising in retrospect. I don't want to do 2 or 3, that's a lot of generic complication to solve this particular problem (unless there's lots of other cases lurking like this). Either 1 or 4 is ok with me.

…

On Sun, Feb 5, 2017 at 5:52 PM, Josh Bleecher Snyder < ***@***.***> wrote: (4) Force through the contorted rule and rely on the test to detect breakage. I'm leaning towards (1), guarded by a condition checking arch support for bittest. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#18943 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGkgIHLNU3H4M_naMSq7hGGbkNjZHIfSks5rZnzTgaJpZM4L3dEU> .

josharian · 2017-02-06T07:12:32Z

That same wrench hits the bit test rules also, so that's unsurprising in retrospect.

...and I just found out what you meant by this, the hard way. :) Bummer. I don't think BTQ buys us anything here in the general case after all.

Fallback plan: Implement only for constant b. Shouldn't be too hard, since I have the general case (mis-)implemented.

rasky · 2017-02-06T11:04:39Z

Even in the non-constant case, there needs to be a way to special-case when the compiler knows that B is < 63. This is exactly the same special-case that is required for other bit-level operations like rotate (#18940), and in this case BTQ helps a lot. So I wouldn't throw the code away.

I'm not qualified to comment on the options you proposed, but I'll notice that many platforms have bit-level operations, or can do bit operations more efficiently than generic AND/OR operations. For instance, ARM32 can materialise single bit masks (while it cannot materialise generic 32-bit words), though maybe the ARM backend has already enough information to do that. I'll check this later today (wanna see the code generated by time.go on ARM).

gopherbot · 2017-02-06T19:50:22Z

CL https://golang.org/cl/36329 mentions this issue.

josharian · 2017-02-06T19:51:55Z

CL coming soon.

It handles:

amd64
a & (1<<b) != 0
bounded non-constants as well as constants (I tested, and TESTQ $small-power-of-two, R and BTQ $small, R are identical in speed, but BTQ is always the same length or shorter)
bool value generation as well as branches

It doesn't handle:

Other architectures
Using a memory arg; all args are loaded into registers.
Other patterns mentioned in the initial request

I leave the decision about other architectures up to @cherrymui and others. Memory args is a general problem.

The other patterns would make a good project for someone getting up to speed on SSA rules, now that the trail has been broken. Marking this as Suggested.

Updates golang#18943 Change-Id: If3080d6133bb6d2710b57294da24c90251ab4e08

Updates #18943 Change-Id: If3080d6133bb6d2710b57294da24c90251ab4e08 Reviewed-on: https://go-review.googlesource.com/36329 Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>

navytux · 2017-03-15T11:23:32Z

FYI commit 2183135 (cmd/compile: recognize bit test patterns on amd64) caused a regression: #19555.

cespare · 2017-07-17T17:22:15Z

I have this partially working (i.e., handling the other cases mentioned in the original report, amd64 only).

Might be a while before I send the CL since I'm going on vacation soon.

rasky · 2017-10-15T22:40:33Z

@cespare still planning on sending the CL?

cespare · 2017-10-31T20:21:19Z

Sorry, I missed a few weeks of notifications due to gmail misconfiguration.

I had forgotten about this. Thanks for reminding me. I'll get back to it shortly, but not before the freeze (tomorrow). If this is important to anyone, please feel free to implement it before I send a CL -- I'm mostly doing it to learn about SSA anyway.

rasky · 2018-02-09T16:58:46Z

@cespare any news on this? Do you have some code somewhere that we can work on?

cespare · 2018-02-10T21:11:37Z

I haven't made time to work on this again. I have a partially working version (somewhere, can't find the branch right now) but it wasn't anything special, just a few fairly obvious rules.

Please feel free to pick this up and send a CL.

gopherbot · 2018-02-18T19:05:39Z

Change https://golang.org/cl/94766 mentions this issue: cmd/compile: add patterns for bit set/clear/complement on amd64

gopherbot · 2018-02-18T20:47:24Z

Change https://golang.org/cl/94764 mentions this issue: cmd/compile: fold bit masking on bits that have been shifted away.

Spotted while working on #18943, it triggers once during bootstrap. Change-Id: Ia4330ccc6395627c233a8eb4dcc0e3e2a770bea7 Reviewed-on: https://go-review.googlesource.com/94764 Reviewed-by: Keith Randall <khr@golang.org>

josharian · 2018-03-24T02:43:23Z

Is it worth porting these rules to 386, since it should be entirely straightforward?

rasky · 2018-03-24T13:29:34Z

Possibly, but I personally never use the 386 backend.

robfig · 2018-03-24T14:30:50Z

Pardon the uninformed drive-by question.. In the CR description, it says "Go1 benchmarks seem unaffected, and I would be surprised otherwise".

I would have expected this sort of change to be done to improve performance, but that statement indicates both that it does not improve performance and that an improvement was not expected. What is the benefit of these changes and how are they being prioritized if there's no measurable effect on the end program? I've seen a number of similar changes, and I was really curious about it. Thanks!

rasky · 2018-03-24T16:29:18Z

It doesn't improve Go1 because Go1 doesn't contain any hot-loop related to bit manipulation. It does make the code more compact, make it use less registers (so improve register pressure), and improve performance of specific code that does lots of bit manipulation. Basically this is in a "long tail" of things that do slight improvements, and eventually all small benefits add up.

randall77 · 2018-03-25T21:46:40Z

Keep in mind the "Go1 benchmarks seem unaffected" and "performance seem unaffected" are two very different statements. We don't want to let deficiencies in our benchmarks (of which there are many) impede progress.

That said, for performance improvement CLs it is always nice to at least have a microbenchmark that demonstrates improvement. Measurements of code size improvements are also useful.

robfig · 2018-03-26T11:40:18Z

Makes sense, thank you for the explanation! I would have expected the crypto benchmarks to show something, but this makes sense.

josharian · 2018-03-28T18:42:09Z

Possibly, but I personally never use the 386 backend.

Yeah. I only use it when I need to test something on a 32 bit platform easily. I wonder whether anyone cares much about it (performance-wise).

ALTree added the Performance label Feb 5, 2017

randall77 added this to the Go1.9 milestone Feb 5, 2017

josharian added the Suggested Issues that may be good for new contributors looking for work to do. label Feb 6, 2017

josharian added a commit to josharian/go that referenced this issue Feb 14, 2017

cmd/compile: recognize bit test patterns on amd64

78382a3

Updates golang#18943 Change-Id: If3080d6133bb6d2710b57294da24c90251ab4e08

rasky mentioned this issue Feb 16, 2017

cmd/compile: complicated bounds check elimination #16092

Open

josharian modified the milestones: Unplanned, Go1.9 May 6, 2017

gopherbot closed this as completed in 7911270 Mar 24, 2018

golang locked and limited conversation to collaborators Mar 28, 2019

gopherbot added the FrozenDueToAge label Mar 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: generate BT (bit-test) instruction on 386/amd64 #18943

cmd/compile: generate BT (bit-test) instruction on 386/amd64 #18943

rasky commented Feb 5, 2017 •

edited

Loading

rasky commented Feb 5, 2017

josharian commented Feb 6, 2017

josharian commented Feb 6, 2017

randall77 commented Feb 6, 2017 via email

josharian commented Feb 6, 2017

rasky commented Feb 6, 2017

gopherbot commented Feb 6, 2017

josharian commented Feb 6, 2017

navytux commented Mar 15, 2017

cespare commented Jul 17, 2017

rasky commented Oct 15, 2017

cespare commented Oct 31, 2017

rasky commented Feb 9, 2018

cespare commented Feb 10, 2018

gopherbot commented Feb 18, 2018

gopherbot commented Feb 18, 2018

josharian commented Mar 24, 2018

rasky commented Mar 24, 2018

robfig commented Mar 24, 2018

rasky commented Mar 24, 2018

randall77 commented Mar 25, 2018

robfig commented Mar 26, 2018

josharian commented Mar 28, 2018

cmd/compile: generate BT (bit-test) instruction on 386/amd64 #18943

cmd/compile: generate BT (bit-test) instruction on 386/amd64 #18943

Comments

rasky commented Feb 5, 2017 • edited Loading

rasky commented Feb 5, 2017

josharian commented Feb 6, 2017

josharian commented Feb 6, 2017

randall77 commented Feb 6, 2017 via email

josharian commented Feb 6, 2017

rasky commented Feb 6, 2017

gopherbot commented Feb 6, 2017

josharian commented Feb 6, 2017

navytux commented Mar 15, 2017

cespare commented Jul 17, 2017

rasky commented Oct 15, 2017

cespare commented Oct 31, 2017

rasky commented Feb 9, 2018

cespare commented Feb 10, 2018

gopherbot commented Feb 18, 2018

gopherbot commented Feb 18, 2018

josharian commented Mar 24, 2018

rasky commented Mar 24, 2018

robfig commented Mar 24, 2018

rasky commented Mar 24, 2018

randall77 commented Mar 25, 2018

robfig commented Mar 26, 2018

josharian commented Mar 28, 2018

rasky commented Feb 5, 2017 •

edited

Loading