Skip to content

cmd/compile: OR into memory is cheaper than MOV/BTSL/MOV on x86 #61694

Closed
@aktau

Description

@aktau

What version of Go are you using (go version)?

$ go version
go1.21-dev +fe5af1532a

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

GOARCH=amd64

What did you do?

We have a code generator that generates a struct with setters. To track whether set has been called for a given field, we flip the bit in a bitmap. The code looks like this:

func setBit(part *uint32, num uint32) {
	*part |= 1 << (num % 32)
}

type x struct {
	bitmap [4]uint32 // A bitmap containing whether "Set" was called on a given field.
	u      int32     // Imagine this is field number 8.
	v      int32     // Imagine this is field number 38.

}

func (m *x) SetV(val int32) {
	m.v = val
	setBit(&(m.bitmap[1]), 37)
}

func (m *x) SetU(val int32) {
	m.u = val
	setBit(&(m.bitmap[0]), 7)
}

What did you expect to see?

I expected similar instructions (with different operands) being generated for both setters.

What did you see instead?

SetU is ~30% slower than SetV, as measured in local benchmarks (on a zen4 machine). The relevant difference is (godbolt):

TEXT    main.(*x).SetV(SB), NOSPLIT|NOFRAME|ABIInternal, $0-16
        MOVL    BX, 20(AX)
        NOP
        ORL     $32, 4(AX)
        RET

TEXT    main.(*x).SetU(SB), NOSPLIT|NOFRAME|ABIInternal, $0-16
        MOVL    BX, 16(AX)
        MOVL    (AX), CX
        BTSL    $7, CX
        NOP
        MOVL    CX, (AX)
        RET

It seems like OR into memory does better than MOV/BTS/MOV.

According to https://www.uops.info/table.html, for skylake-x and zen4, it seems the OR family is pound-for-pound (slightly) better than the BTS family:

Instruction Lat TP Uops Ports Lat TP Uops Ports
BTS (M32, I8) [≤3;≤10] 1.00 / 1.00 3 / 4 1p06+1p23+1p237+1p4 [5;12] 2.00 4
OR (M32, I32) [≤3;≤10] 1.00 / 1.00 2 / 4 1p0156+1p23+1p237+1p4 [≤1;≤8] 0.56 2
BTS (R32, I8) 1 0.50 / 0.50 1 / 1 1*p06 [1;2] 1.00 2
OR (R32, I8) 1 0.25 / 0.25 1 / 1 1*p0156 1 0.25 1

I didn't look up what those MOV instructions cost, but it's difficult to predict costs from individual operations in the complex processors of today. Things I didn't test (because the Go compiler doesn't generate/inline them:

  • MOV/OR/MOV
  • BTS memory,immediate

Some of the speedup may be due to the shorter instruction sequence, too.

Metadata

Metadata

Assignees

Labels

FrozenDueToAgeNeedsInvestigationSomeone must examine and confirm this is a valid issue and not a duplicate of an existing one.Performancecompiler/runtimeIssues related to the Go compiler and/or runtime.

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions