strange performance regression: lookup vs rsh #48278

holiman · 2021-09-09T11:07:10Z

What version of Go are you using (`go version`)?

go version go1.17 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (`go env`)?

go env Output

GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/user/.cache/go-build"
GOENV="/home/user/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/user/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/user/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.17"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/home/user/go/src/github.com/ethereum/go-ethereum/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build171075918=/tmp/go-build -gno-record-gcc-switches"

What did you do?

In go-ethereum, we have a an algorithm that iterates over a slice of bytes, and fills a bitmap depending on the values of those bytes. Certain values set a 1, others leave it at 0.

While optimizing it, we found a behaviour that was very unintuitive, and we could not explain it. By using a small lookup table instead of a right-shift, performance increased by ~20%.

This PR attempts to remove the lookup, and notes the performance regression: ethereum/go-ethereum#23472 .

I've made a PoC repro.
analysis.go

package main

type bitvec []byte

var lookup = [8]byte{
	0x80, 0x40, 0x20, 0x10, 0x8, 0x4, 0x2, 0x1,
}

func (bits bitvec) set1(pos uint64) {
	if true {
		bits[pos/8] |= lookup[pos%8]
	} else {
		bits[pos/8] |= 0x80 >> (pos & 7)
	}
}

func codeBitmapInternal(code []byte, bits bitvec) bitvec {
	for pc := uint64(0); pc < uint64(len(code)); {
		op := code[pc]
		pc++
		if op < 0x60 || op > 0x7f {
			continue
		}
		numbits := op - 0x60 + 1
		switch numbits {
		case 1:
			bits.set1(pc)
			pc += 1
		}
	}
	return bits
}

analysis_test.go:

package main

import (
	"fmt"
	"testing"
)

const analysisCodeSize = 1200 * 1024

func BenchmarkJumpdestOpAnalysis(bench *testing.B) {
	var op byte
	bencher := func(b *testing.B) {
		code := make([]byte, analysisCodeSize)
		b.SetBytes(analysisCodeSize)
		for i := range code {
			code[i] = op
		}
		bits := make(bitvec, len(code)/8+1+4)
		b.ResetTimer()
		for i := 0; i < b.N; i++ {
			for j := range bits {
				bits[j] = 0
			}
			codeBitmapInternal(code, bits)
		}
	}
	op = 0x60
	bench.Run(fmt.Sprintf("%x", op), bencher)
	op = 0
	bench.Run(fmt.Sprintf("%x", op), bencher)
}

And a script to perform the benchmark, then replace the shift with a lookup:
test.sh:

sed -i 's/true/false/' analysis.go
go test . -bench . --count 5  > 1.txt
sed -i 's/false/true/' analysis.go
go test . -bench . --count 5  > 2.txt
benchstat 1.txt 2.txt

Running this yields:

bash test.sh
name                     old time/op    new time/op    delta
JumpdestOpAnalysis/60-6    1.06ms ± 2%    0.87ms ± 3%  -17.81%  (p=0.008 n=5+5)
JumpdestOpAnalysis/0-6     1.01ms ± 3%    1.01ms ± 4%     ~     (p=0.548 n=5+5)

name                     old speed      new speed      delta
JumpdestOpAnalysis/60-6  1.16GB/s ± 2%  1.41GB/s ± 3%  +21.70%  (p=0.008 n=5+5)
JumpdestOpAnalysis/0-6   1.21GB/s ± 3%  1.22GB/s ± 4%     ~     (p=0.548 n=5+5)

In other words, the code using:

var lookup = [8]byte{
	0x80, 0x40, 0x20, 0x10, 0x8, 0x4, 0x2, 0x1,
}
...
		bits[pos/8] |= lookup[pos%8]

Is a lot faster than the code using right shift:

		bits[pos/8] |= 0x80 >> (pos & 7)

What did you expect to see?

I expected the right shift operation to be at least as fast as the lookup-based variant.

What did you see instead?

The lookup being faster.

So, this is maybe not a bug, but it's behaviour that I would be grateful to figure out the underlying reason for. I guess also there is a slight chance that the compiler somehow misses an opportunity to optimize something here.

I have studied the gcflags="-m -m " output to see if the inlining output showed anything: but afaict the inlining behaviour is identical across both variants.

The text was updated successfully, but these errors were encountered:

holiman · 2021-09-09T20:21:53Z

And actually, there's one more thing we've been scratching heads over. If we make the switch a bit more complex, adding a method which won't be used in the benchmark:

const (
	set2BitsMask = uint16(0b1100_0000_0000_0000)
	set3BitsMask = uint16(0b1110_0000_0000_0000)
	set4BitsMask = uint16(0b1111_0000_0000_0000)
	set5BitsMask = uint16(0b1111_1000_0000_0000)
	set6BitsMask = uint16(0b1111_1100_0000_0000)
	set7BitsMask = uint16(0b1111_1110_0000_0000)
)

func (bits bitvec) setN(flag uint16, pos uint64) {
	a := flag >> (pos % 8)
	bits[pos/8] |= byte(a >> 8)
	if b := byte(a); b != 0 {
		//	If the bit-setting affects the neighbouring byte, we can assign - no need to OR it,
		//	since it's the first write to that byte
		bits[pos/8+1] = b
	}
}

But which adds more cases to the switch:

func codeBitmapInternal(code []byte, bits bitvec) bitvec {
	for pc := uint64(0); pc < uint64(len(code)); {
		op := code[pc]
		pc++
		if op < 0x60 || op > 0x7f {
			continue
		}
		numbits := op - 0x60 + 1
		switch numbits {
		case 1:
			bits.set1(pc)
			pc += 1
		case 2:
			bits.setN(set2BitsMask, pc)
			pc += 2
		case 3:
			bits.setN(set3BitsMask, pc)
			pc += 3
		case 4:
			bits.setN(set4BitsMask, pc)
			pc += 4
		case 5:
			bits.setN(set5BitsMask, pc)
			pc += 5
		case 6:
			bits.setN(set6BitsMask, pc)
			pc += 6
		case 7:
			bits.setN(set7BitsMask, pc)
			pc += 7
		}
	}
	return bits
}

Then, suddenly, the usage of lookup versus right-shift starts to affect the 0-case. Theoretically, when the code is filled with zeroes only, whether we use lookup or right-shift further down in the loop should not matter.

name                     old time/op    new time/op    delta
JumpdestOpAnalysis/60-6    1.20ms ± 4%    1.05ms ± 2%  -12.59%  (p=0.008 n=5+5)
JumpdestOpAnalysis/0-6      600µs ± 1%     782µs ± 2%  +30.28%  (p=0.008 n=5+5)

name                     old speed      new speed      delta
JumpdestOpAnalysis/60-6  1.03GB/s ± 4%  1.17GB/s ± 2%  +14.38%  (p=0.008 n=5+5)
JumpdestOpAnalysis/0-6   2.05GB/s ± 1%  1.57GB/s ± 2%  -23.23%  (p=0.008 n=5+5)

randall77 · 2021-09-09T21:41:52Z

I don't see anything obvious from the generated assembly.
There is:

	0x0058 00088 (analysis.go:13)	MOVL	$-128, R12
	0x005e 00094 (analysis.go:13)	SHRB	CL, R12B
	0x0061 00097 (analysis.go:13)	ORL	R12, R10

Possibly that latter ORL has a partial register stall on the R12 input because part of it is generated by the MOVL and part by the SHRB. As an experiment, you could try making the bitvector a []uint32 or []uint64 instead of a []byte and see if that makes a difference.

holiman · 2021-09-10T10:41:38Z

As an experiment, you could try making the bitvector a []uint32 or []uint64 instead of a []byte and see if that makes a difference.

Ok, I made the change by just replacing the types of the lookup and the bitvec but otherwise not changing anything about the code.

[]byte:

name                     old time/op    new time/op    delta
JumpdestOpAnalysis/60-6    1.05ms ± 3%    0.89ms ± 6%  -15.14%  (p=0.008 n=5+5)
JumpdestOpAnalysis/0-6     1.00ms ± 1%    1.06ms ± 5%     ~     (p=0.056 n=5+5)

name                     old speed      new speed      delta
JumpdestOpAnalysis/60-6  1.17GB/s ± 3%  1.38GB/s ± 6%  +17.95%  (p=0.008 n=5+5)
JumpdestOpAnalysis/0-6   1.22GB/s ± 1%  1.16GB/s ± 6%     ~     (p=0.056 n=5+5)
[user@work codeanalysis]$ bash test.sh

[]uint32:

name                     old time/op    new time/op    delta
JumpdestOpAnalysis/60-6    1.03ms ± 1%    0.89ms ± 4%  -14.15%  (p=0.008 n=5+5)
JumpdestOpAnalysis/0-6     1.01ms ± 0%    1.06ms ± 6%   +4.15%  (p=0.032 n=4+5)

name                     old speed      new speed      delta
JumpdestOpAnalysis/60-6  1.19GB/s ± 1%  1.39GB/s ± 4%  +16.52%  (p=0.008 n=5+5)
JumpdestOpAnalysis/0-6   1.21GB/s ± 0%  1.17GB/s ± 6%   -3.85%  (p=0.032 n=4+5)

[]uint64:

[user@work codeanalysis]$ bash test.sh
name                     old time/op    new time/op    delta
JumpdestOpAnalysis/60-6    1.11ms ± 3%    0.94ms ± 2%  -14.69%  (p=0.008 n=5+5)
JumpdestOpAnalysis/0-6     1.05ms ± 2%    1.14ms ± 7%   +8.65%  (p=0.008 n=5+5)

name                     old speed      new speed      delta
JumpdestOpAnalysis/60-6  1.11GB/s ± 3%  1.30GB/s ± 2%  +17.20%  (p=0.008 n=5+5)
JumpdestOpAnalysis/0-6   1.17GB/s ± 2%  1.08GB/s ± 6%   -7.84%  (p=0.008 n=5+5)

randall77 · 2021-09-10T14:44:02Z

I'm out of ideas.
I don't think there is a Go bug here - if anything this is just a "feature" of the processor. I'm going to close this issue, but please feel free to reopen if you find something non-optimal about the generated code.

holiman · 2021-09-10T15:32:13Z

Gotcha, thanks for looking into it!

thanm added the Performance label Sep 9, 2021

randall77 closed this as completed Sep 10, 2021

golang locked and limited conversation to collaborators Sep 10, 2022

gopherbot added the FrozenDueToAge label Sep 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

strange performance regression: lookup vs rsh #48278

strange performance regression: lookup vs rsh #48278

holiman commented Sep 9, 2021

holiman commented Sep 9, 2021

randall77 commented Sep 9, 2021

holiman commented Sep 10, 2021

randall77 commented Sep 10, 2021

holiman commented Sep 10, 2021 via email

strange performance regression: lookup vs rsh #48278

strange performance regression: lookup vs rsh #48278

Comments

holiman commented Sep 9, 2021

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

holiman commented Sep 9, 2021

randall77 commented Sep 9, 2021

holiman commented Sep 10, 2021

randall77 commented Sep 10, 2021

holiman commented Sep 10, 2021 via email

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?