Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: generate BT (bit-test) instruction on 386/amd64 #18943

Closed
rasky opened this issue Feb 5, 2017 · 23 comments
Closed

cmd/compile: generate BT (bit-test) instruction on 386/amd64 #18943

rasky opened this issue Feb 5, 2017 · 23 comments
Labels
FrozenDueToAge Performance Suggested Issues that may be good for new contributors looking for work to do.
Milestone

Comments

@rasky
Copy link
Member

rasky commented Feb 5, 2017

The x86 instruction set has a BT (bit test) opcode, which is the moral equivalent of A & (1<<B), with no write back and only carry flag affected.

LLVM uses it when it detects (at least) one of the following patterns:

    if (a & (1<<b) != 0) 
    if (~a & (1<<b) != 0) 
    if ((a >> b) & 1 != 0) 
    if ((~a >> b) & 1 != 0) 
    if (~(a >> b) & 1 != 0) 

(note that the bitwise not is not emitted, it's used to reverse the conditional branch)

Currently, Go generates the whole shift sequence, which is much longer and wastes two registers (the intermediate shift result, and ECX as shift count -- the latter because regalloc always emit a MOV to populate ECX when needed for shifting).

@rasky
Copy link
Member Author

rasky commented Feb 5, 2017

This is useful also for the constant case (that is, when b is constant).

For instance, time on tip has a hasMonotonic constant which is 1<<63. Currently, checks like:

    if t.wall&hasMonotonic {

generate this:

	time.go:166	0x40877f5	488b08			MOVQ 0(AX), CX
	time.go:166	0x40877f8	4889ca			MOVQ CX, DX
	time.go:166	0x40877fb	48c1e93f		SHRQ $0x3f, CX
	time.go:166	0x40877ff	48c1e13f		SHLQ $0x3f, CX
	time.go:166	0x4087803	48c7c3ffffffff		MOVQ $-0x1, BX
	time.go:166	0x408780a	4885cb			TESTQ CX, BX
	time.go:166	0x408780d	746b			JE 0x408787a

while this could actually be a single BTQ $0x3f, CX, and no registers wasted.

(In fact, testing against highest bit could also be done with CMP and the sign flag, but I'm not sure it makes sense to special-case it).

@randall77 randall77 added this to the Go1.9 milestone Feb 5, 2017
@josharian
Copy link
Contributor

I started playing with this, figuring it'd be an easy rule to add. (Haha.)

The natural place to put this is AMD64.rules. However, the rule ends up being really contorted, due to ordering problems--despite my best efforts, the shift gets lowered first, which means the rule needs to match the fully expanded shift pattern. And this will be fragile--it'll be broken by any changes to how shifts get lowered (which might well be the cause of #18946).

I see three mediocre options.

(1) Add a generic OpBitTest, and write a generic matching rule. Let each architecture lower BitTest however appropriate, possibly just expanding it back out to shifts and compares. Upsides: Unified, simple. Downsides: Extra work for architectures without bit test instructions, might still need arch-specific rules as well to catch opportunities introduced after generic rewrites.

(2) Split the arch-specific optimization and lowering into multiple rewrite passes. This could then be done in an early arch pass. Upsides: More targeted, no extra generic ops. Downsides: Having multiple passes seems complicated (how are they specified? do we run each pass once or cycle through them to a fixed point?).

(3) Add some kind of complexity ordering to rulegen, such that more specific rules always match before more general ones. Upsides: Magic. Downsides: Magic, generality is not obviously well-defined, might be harder to implement and to understand, might make rewrite pass slower.

Input requested, both specific to this optimization and more generally on the design question.

Somewhat related: #16270 and #15300.

@josharian
Copy link
Contributor

(4) Force through the contorted rule and rely on the test to detect breakage.

I'm leaning towards (1), guarded by a condition checking arch support for bittest.

@randall77
Copy link
Contributor

randall77 commented Feb 6, 2017 via email

@josharian
Copy link
Contributor

That same wrench hits the bit test rules also, so that's unsurprising in retrospect.

...and I just found out what you meant by this, the hard way. :) Bummer. I don't think BTQ buys us anything here in the general case after all.

Fallback plan: Implement only for constant b. Shouldn't be too hard, since I have the general case (mis-)implemented.

@rasky
Copy link
Member Author

rasky commented Feb 6, 2017

Even in the non-constant case, there needs to be a way to special-case when the compiler knows that B is < 63. This is exactly the same special-case that is required for other bit-level operations like rotate (#18940), and in this case BTQ helps a lot. So I wouldn't throw the code away.

I'm not qualified to comment on the options you proposed, but I'll notice that many platforms have bit-level operations, or can do bit operations more efficiently than generic AND/OR operations. For instance, ARM32 can materialise single bit masks (while it cannot materialise generic 32-bit words), though maybe the ARM backend has already enough information to do that. I'll check this later today (wanna see the code generated by time.go on ARM).

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/36329 mentions this issue.

@josharian
Copy link
Contributor

CL coming soon.

It handles:

  • amd64
  • a & (1<<b) != 0
  • bounded non-constants as well as constants (I tested, and TESTQ $small-power-of-two, R and BTQ $small, R are identical in speed, but BTQ is always the same length or shorter)
  • bool value generation as well as branches

It doesn't handle:

  • Other architectures
  • Using a memory arg; all args are loaded into registers.
  • Other patterns mentioned in the initial request

I leave the decision about other architectures up to @cherrymui and others. Memory args is a general problem.

The other patterns would make a good project for someone getting up to speed on SSA rules, now that the trail has been broken. Marking this as Suggested.

@josharian josharian added the Suggested Issues that may be good for new contributors looking for work to do. label Feb 6, 2017
josharian added a commit to josharian/go that referenced this issue Feb 14, 2017
Updates golang#18943

Change-Id: If3080d6133bb6d2710b57294da24c90251ab4e08
gopherbot pushed a commit that referenced this issue Mar 1, 2017
Updates #18943

Change-Id: If3080d6133bb6d2710b57294da24c90251ab4e08
Reviewed-on: https://go-review.googlesource.com/36329
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
@navytux
Copy link
Contributor

navytux commented Mar 15, 2017

FYI commit 2183135 (cmd/compile: recognize bit test patterns on amd64) caused a regression: #19555.

@josharian josharian modified the milestones: Unplanned, Go1.9 May 6, 2017
@cespare
Copy link
Contributor

cespare commented Jul 17, 2017

I have this partially working (i.e., handling the other cases mentioned in the original report, amd64 only).

Might be a while before I send the CL since I'm going on vacation soon.

@rasky
Copy link
Member Author

rasky commented Oct 15, 2017

@cespare still planning on sending the CL?

@cespare
Copy link
Contributor

cespare commented Oct 31, 2017

Sorry, I missed a few weeks of notifications due to gmail misconfiguration.

I had forgotten about this. Thanks for reminding me. I'll get back to it shortly, but not before the freeze (tomorrow). If this is important to anyone, please feel free to implement it before I send a CL -- I'm mostly doing it to learn about SSA anyway.

@rasky
Copy link
Member Author

rasky commented Feb 9, 2018

@cespare any news on this? Do you have some code somewhere that we can work on?

@cespare
Copy link
Contributor

cespare commented Feb 10, 2018

I haven't made time to work on this again. I have a partially working version (somewhere, can't find the branch right now) but it wasn't anything special, just a few fairly obvious rules.

Please feel free to pick this up and send a CL.

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/94766 mentions this issue: cmd/compile: add patterns for bit set/clear/complement on amd64

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/94764 mentions this issue: cmd/compile: fold bit masking on bits that have been shifted away.

gopherbot pushed a commit that referenced this issue Feb 27, 2018
Spotted while working on #18943, it triggers once during bootstrap.

Change-Id: Ia4330ccc6395627c233a8eb4dcc0e3e2a770bea7
Reviewed-on: https://go-review.googlesource.com/94764
Reviewed-by: Keith Randall <khr@golang.org>
@josharian
Copy link
Contributor

Is it worth porting these rules to 386, since it should be entirely straightforward?

@rasky
Copy link
Member Author

rasky commented Mar 24, 2018

Possibly, but I personally never use the 386 backend.

@robfig
Copy link
Contributor

robfig commented Mar 24, 2018

Pardon the uninformed drive-by question.. In the CR description, it says "Go1 benchmarks seem unaffected, and I would be surprised otherwise".

I would have expected this sort of change to be done to improve performance, but that statement indicates both that it does not improve performance and that an improvement was not expected. What is the benefit of these changes and how are they being prioritized if there's no measurable effect on the end program? I've seen a number of similar changes, and I was really curious about it. Thanks!

@rasky
Copy link
Member Author

rasky commented Mar 24, 2018

It doesn't improve Go1 because Go1 doesn't contain any hot-loop related to bit manipulation. It does make the code more compact, make it use less registers (so improve register pressure), and improve performance of specific code that does lots of bit manipulation. Basically this is in a "long tail" of things that do slight improvements, and eventually all small benefits add up.

@randall77
Copy link
Contributor

Keep in mind the "Go1 benchmarks seem unaffected" and "performance seem unaffected" are two very different statements. We don't want to let deficiencies in our benchmarks (of which there are many) impede progress.

That said, for performance improvement CLs it is always nice to at least have a microbenchmark that demonstrates improvement. Measurements of code size improvements are also useful.

For small changes with no obvious downsides (like this one), sometimes we Just Do It™®©.

@robfig
Copy link
Contributor

robfig commented Mar 26, 2018

Makes sense, thank you for the explanation! I would have expected the crypto benchmarks to show something, but this makes sense.

@josharian
Copy link
Contributor

Possibly, but I personally never use the 386 backend.

Yeah. I only use it when I need to test something on a 32 bit platform easily. I wonder whether anyone cares much about it (performance-wise).

@golang golang locked and limited conversation to collaborators Mar 28, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge Performance Suggested Issues that may be good for new contributors looking for work to do.
Projects
None yet
Development

No branches or pull requests

8 participants