-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: OR into memory is cheaper than MOV/BTSL/MOV on x86 #61694
Comments
We use
We could convert back to |
I believe saving 2 bytes is not worth the reduced throughput. |
Could you post some full benchmark code? My simple benchmark isn't showing any difference between the two.
Enabling or disabling those rules I quoted (which have the effect of switching from
|
@randall77 Your benchmark does not work because it has carried dependency, i.e the operation in an iteration depends on the result of the previous one, and the out-of-order engine cannot take effect because all operations are sequential. As a result, your benchmark only successfully measures the latency of both instructions which are both 1, hence the similar result. To measure the throughput of the instructions, you need to interleave multiple independent operations. I made a quickbench and the results show that the Thanks. |
Here's an improved benchmark:
Run with I'm still not seeing much difference. A bit better with the all-in-register version, but load/BTS/store is just as fast as OR to memory.
|
For memory operands, the bottleneck is the store-forwarding latency which has a whopping value of around 5 cycles, this overwhelms any other computation since 4 My point is that transforming an |
Also, the OP executed their benchmark on a Zen 4 CPU, which has exceptionally bad |
But
I was wrong, my machine is actually Zen 2. Here's my results:
|
100% agree. I think the question here boils down to: should we fix this by adding BTSconst-to-memory instructions, or should we fix this by getting rid of BTSconst altogether? |
UPDATE: The size comparison below is not accurate, as BTS takes New (correct) size note: From a size point of view, I played around with https://defuse.ca/online-x86-assembler.htm#disassembly a little bit:
Indeed OR is bigger by one byte for bits 0..7, and bigger by two bytes for bits 8..31 (I assume bigger by four bytes when 32..61, but ny Lua repl didn't make this easy)
go/src/cmd/compile/internal/ssa/_gen/AMD64.rules Lines 649 to 658 in 68a32ce
uint64(c) >= 128 . |
Use a larger constant. 0x20 fits in a single byte, whereas 0x80 does not (which is 1<<7). The larger OR is needed for constants bigger than 128, which is bit index 7 and larger. |
Sorry about that, I corrected it above. Indeed it's a 1 byte disadvantage for OR for bits 0..7, and 2 bytes disadvantage for bits 8..31. It's hard for me to tell how significant that could be. In a binary I have lying around here:
So we'd add (61242*2)/1954399656 = 0.0062% to the binary. That seems more or less alright, but it's also a sample size of one. Also I'd note that perhaps removing the rewrite rules would eliminate the MOV/BTS/MOV transformation, thus making this a net gain. I've also observed some C++-generated instructions while scanning this binary (it's a mixed binary). LLVM does generate larger ORs. |
For memory operation, gcc wisely always uses |
@randall77 I believe you can't do
Now there are 4 cases:
As a result, the final value in x can be only one of 1, 0x102, 0x103. However, if you implement |
@merykitty I don't think the issue you raised is a showstopper, at least directly. That code has a race in it and we're allowed to do anything we want when there is a race. |
@randall77 The Go memory model mandates that:
So it is not right that we can do anything we want in the presence of data race. |
I see, so you are claiming that we get a word-tearing violation if we use sub-word OR-with-memory instructions. |
Ok, I think I agree with @merykitty that we can't use thinner ORs, it leads to word tearing which we promise not to do. Proposal: use ORconst for anything which can be done with a single instruction. So with both registers and memory targets. That should tend to choose performance over code size. Although I think the performance gains are not terribly large, neither is the code size penalty. |
Separately, we should then allow BTSQ to operate directly on memory. |
Change https://go.dev/cl/515755 mentions this issue: |
GCC does use void f(unsigned long long *p) {
*p |= 0x800000000ULL;
}
|
Ah yes, when I said |
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes.
What operating system and processor architecture are you using (
go env
)?What did you do?
We have a code generator that generates a struct with setters. To track whether set has been called for a given field, we flip the bit in a bitmap. The code looks like this:
What did you expect to see?
I expected similar instructions (with different operands) being generated for both setters.
What did you see instead?
SetU
is ~30% slower thanSetV
, as measured in local benchmarks (on a zen4 machine). The relevant difference is (godbolt):It seems like
OR
into memory does better thanMOV/BTS/MOV
.According to https://www.uops.info/table.html, for skylake-x and zen4, it seems the OR family is pound-for-pound (slightly) better than the BTS family:
I didn't look up what those
MOV
instructions cost, but it's difficult to predict costs from individual operations in the complex processors of today. Things I didn't test (because the Go compiler doesn't generate/inline them:Some of the speedup may be due to the shorter instruction sequence, too.
The text was updated successfully, but these errors were encountered: