Description
What version of Go are you using (go version
)?
$ go version go1.21-dev +fe5af1532a
Does this issue reproduce with the latest release?
Yes.
What operating system and processor architecture are you using (go env
)?
GOARCH=amd64
What did you do?
We have a code generator that generates a struct with setters. To track whether set has been called for a given field, we flip the bit in a bitmap. The code looks like this:
func setBit(part *uint32, num uint32) {
*part |= 1 << (num % 32)
}
type x struct {
bitmap [4]uint32 // A bitmap containing whether "Set" was called on a given field.
u int32 // Imagine this is field number 8.
v int32 // Imagine this is field number 38.
}
func (m *x) SetV(val int32) {
m.v = val
setBit(&(m.bitmap[1]), 37)
}
func (m *x) SetU(val int32) {
m.u = val
setBit(&(m.bitmap[0]), 7)
}
What did you expect to see?
I expected similar instructions (with different operands) being generated for both setters.
What did you see instead?
SetU
is ~30% slower than SetV
, as measured in local benchmarks (on a zen4 machine). The relevant difference is (godbolt):
TEXT main.(*x).SetV(SB), NOSPLIT|NOFRAME|ABIInternal, $0-16
MOVL BX, 20(AX)
NOP
ORL $32, 4(AX)
RET
TEXT main.(*x).SetU(SB), NOSPLIT|NOFRAME|ABIInternal, $0-16
MOVL BX, 16(AX)
MOVL (AX), CX
BTSL $7, CX
NOP
MOVL CX, (AX)
RET
It seems like OR
into memory does better than MOV/BTS/MOV
.
According to https://www.uops.info/table.html, for skylake-x and zen4, it seems the OR family is pound-for-pound (slightly) better than the BTS family:
Instruction | Lat | TP | Uops | Ports | Lat | TP | Uops | Ports |
---|---|---|---|---|---|---|---|---|
BTS (M32, I8) | [≤3;≤10] | 1.00 / 1.00 | 3 / 4 | 1p06+1p23+1p237+1p4 | [5;12] | 2.00 | 4 | |
OR (M32, I32) | [≤3;≤10] | 1.00 / 1.00 | 2 / 4 | 1p0156+1p23+1p237+1p4 | [≤1;≤8] | 0.56 | 2 | |
BTS (R32, I8) | 1 | 0.50 / 0.50 | 1 / 1 | 1*p06 | [1;2] | 1.00 | 2 | |
OR (R32, I8) | 1 | 0.25 / 0.25 | 1 / 1 | 1*p0156 | 1 | 0.25 | 1 |
I didn't look up what those MOV
instructions cost, but it's difficult to predict costs from individual operations in the complex processors of today. Things I didn't test (because the Go compiler doesn't generate/inline them:
- MOV/OR/MOV
- BTS memory,immediate
Some of the speedup may be due to the shorter instruction sequence, too.