Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

strange performance regression: lookup vs rsh #48278

Closed
holiman opened this issue Sep 9, 2021 · 5 comments
Closed

strange performance regression: lookup vs rsh #48278

holiman opened this issue Sep 9, 2021 · 5 comments

Comments

@holiman
Copy link

holiman commented Sep 9, 2021

What version of Go are you using (go version)?

go version go1.17 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/user/.cache/go-build"
GOENV="/home/user/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/user/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/user/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.17"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/home/user/go/src/github.com/ethereum/go-ethereum/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build171075918=/tmp/go-build -gno-record-gcc-switches"

What did you do?

In go-ethereum, we have a an algorithm that iterates over a slice of bytes, and fills a bitmap depending on the values of those bytes. Certain values set a 1, others leave it at 0.

While optimizing it, we found a behaviour that was very unintuitive, and we could not explain it. By using a small lookup table instead of a right-shift, performance increased by ~20%.

This PR attempts to remove the lookup, and notes the performance regression: ethereum/go-ethereum#23472 .

I've made a PoC repro.
analysis.go

package main

type bitvec []byte

var lookup = [8]byte{
	0x80, 0x40, 0x20, 0x10, 0x8, 0x4, 0x2, 0x1,
}

func (bits bitvec) set1(pos uint64) {
	if true {
		bits[pos/8] |= lookup[pos%8]
	} else {
		bits[pos/8] |= 0x80 >> (pos & 7)
	}
}

func codeBitmapInternal(code []byte, bits bitvec) bitvec {
	for pc := uint64(0); pc < uint64(len(code)); {
		op := code[pc]
		pc++
		if op < 0x60 || op > 0x7f {
			continue
		}
		numbits := op - 0x60 + 1
		switch numbits {
		case 1:
			bits.set1(pc)
			pc += 1
		}
	}
	return bits
}

analysis_test.go:

package main

import (
	"fmt"
	"testing"
)

const analysisCodeSize = 1200 * 1024

func BenchmarkJumpdestOpAnalysis(bench *testing.B) {
	var op byte
	bencher := func(b *testing.B) {
		code := make([]byte, analysisCodeSize)
		b.SetBytes(analysisCodeSize)
		for i := range code {
			code[i] = op
		}
		bits := make(bitvec, len(code)/8+1+4)
		b.ResetTimer()
		for i := 0; i < b.N; i++ {
			for j := range bits {
				bits[j] = 0
			}
			codeBitmapInternal(code, bits)
		}
	}
	op = 0x60
	bench.Run(fmt.Sprintf("%x", op), bencher)
	op = 0
	bench.Run(fmt.Sprintf("%x", op), bencher)
}

And a script to perform the benchmark, then replace the shift with a lookup:
test.sh:

sed -i 's/true/false/' analysis.go
go test . -bench . --count 5  > 1.txt
sed -i 's/false/true/' analysis.go
go test . -bench . --count 5  > 2.txt
benchstat 1.txt 2.txt

Running this yields:

bash test.sh
name                     old time/op    new time/op    delta
JumpdestOpAnalysis/60-6    1.06ms ± 2%    0.87ms ± 3%  -17.81%  (p=0.008 n=5+5)
JumpdestOpAnalysis/0-6     1.01ms ± 3%    1.01ms ± 4%     ~     (p=0.548 n=5+5)

name                     old speed      new speed      delta
JumpdestOpAnalysis/60-6  1.16GB/s ± 2%  1.41GB/s ± 3%  +21.70%  (p=0.008 n=5+5)
JumpdestOpAnalysis/0-6   1.21GB/s ± 3%  1.22GB/s ± 4%     ~     (p=0.548 n=5+5)

In other words, the code using:

var lookup = [8]byte{
	0x80, 0x40, 0x20, 0x10, 0x8, 0x4, 0x2, 0x1,
}
...
		bits[pos/8] |= lookup[pos%8]

Is a lot faster than the code using right shift:

		bits[pos/8] |= 0x80 >> (pos & 7)

What did you expect to see?

I expected the right shift operation to be at least as fast as the lookup-based variant.

What did you see instead?

The lookup being faster.

So, this is maybe not a bug, but it's behaviour that I would be grateful to figure out the underlying reason for. I guess also there is a slight chance that the compiler somehow misses an opportunity to optimize something here.

I have studied the gcflags="-m -m " output to see if the inlining output showed anything: but afaict the inlining behaviour is identical across both variants.

@holiman
Copy link
Author

holiman commented Sep 9, 2021

And actually, there's one more thing we've been scratching heads over. If we make the switch a bit more complex, adding a method which won't be used in the benchmark:

const (
	set2BitsMask = uint16(0b1100_0000_0000_0000)
	set3BitsMask = uint16(0b1110_0000_0000_0000)
	set4BitsMask = uint16(0b1111_0000_0000_0000)
	set5BitsMask = uint16(0b1111_1000_0000_0000)
	set6BitsMask = uint16(0b1111_1100_0000_0000)
	set7BitsMask = uint16(0b1111_1110_0000_0000)
)

func (bits bitvec) setN(flag uint16, pos uint64) {
	a := flag >> (pos % 8)
	bits[pos/8] |= byte(a >> 8)
	if b := byte(a); b != 0 {
		//	If the bit-setting affects the neighbouring byte, we can assign - no need to OR it,
		//	since it's the first write to that byte
		bits[pos/8+1] = b
	}
}

But which adds more cases to the switch:

func codeBitmapInternal(code []byte, bits bitvec) bitvec {
	for pc := uint64(0); pc < uint64(len(code)); {
		op := code[pc]
		pc++
		if op < 0x60 || op > 0x7f {
			continue
		}
		numbits := op - 0x60 + 1
		switch numbits {
		case 1:
			bits.set1(pc)
			pc += 1
		case 2:
			bits.setN(set2BitsMask, pc)
			pc += 2
		case 3:
			bits.setN(set3BitsMask, pc)
			pc += 3
		case 4:
			bits.setN(set4BitsMask, pc)
			pc += 4
		case 5:
			bits.setN(set5BitsMask, pc)
			pc += 5
		case 6:
			bits.setN(set6BitsMask, pc)
			pc += 6
		case 7:
			bits.setN(set7BitsMask, pc)
			pc += 7
		}
	}
	return bits
}

Then, suddenly, the usage of lookup versus right-shift starts to affect the 0-case. Theoretically, when the code is filled with zeroes only, whether we use lookup or right-shift further down in the loop should not matter.

name                     old time/op    new time/op    delta
JumpdestOpAnalysis/60-6    1.20ms ± 4%    1.05ms ± 2%  -12.59%  (p=0.008 n=5+5)
JumpdestOpAnalysis/0-6      600µs ± 1%     782µs ± 2%  +30.28%  (p=0.008 n=5+5)

name                     old speed      new speed      delta
JumpdestOpAnalysis/60-6  1.03GB/s ± 4%  1.17GB/s ± 2%  +14.38%  (p=0.008 n=5+5)
JumpdestOpAnalysis/0-6   2.05GB/s ± 1%  1.57GB/s ± 2%  -23.23%  (p=0.008 n=5+5)

@randall77
Copy link
Contributor

I don't see anything obvious from the generated assembly.
There is:

	0x0058 00088 (analysis.go:13)	MOVL	$-128, R12
	0x005e 00094 (analysis.go:13)	SHRB	CL, R12B
	0x0061 00097 (analysis.go:13)	ORL	R12, R10

Possibly that latter ORL has a partial register stall on the R12 input because part of it is generated by the MOVL and part by the SHRB. As an experiment, you could try making the bitvector a []uint32 or []uint64 instead of a []byte and see if that makes a difference.

@holiman
Copy link
Author

holiman commented Sep 10, 2021

As an experiment, you could try making the bitvector a []uint32 or []uint64 instead of a []byte and see if that makes a difference.

Ok, I made the change by just replacing the types of the lookup and the bitvec but otherwise not changing anything about the code.

[]byte:

name                     old time/op    new time/op    delta
JumpdestOpAnalysis/60-6    1.05ms ± 3%    0.89ms ± 6%  -15.14%  (p=0.008 n=5+5)
JumpdestOpAnalysis/0-6     1.00ms ± 1%    1.06ms ± 5%     ~     (p=0.056 n=5+5)

name                     old speed      new speed      delta
JumpdestOpAnalysis/60-6  1.17GB/s ± 3%  1.38GB/s ± 6%  +17.95%  (p=0.008 n=5+5)
JumpdestOpAnalysis/0-6   1.22GB/s ± 1%  1.16GB/s ± 6%     ~     (p=0.056 n=5+5)
[user@work codeanalysis]$ bash test.sh

[]uint32:

name                     old time/op    new time/op    delta
JumpdestOpAnalysis/60-6    1.03ms ± 1%    0.89ms ± 4%  -14.15%  (p=0.008 n=5+5)
JumpdestOpAnalysis/0-6     1.01ms ± 0%    1.06ms ± 6%   +4.15%  (p=0.032 n=4+5)

name                     old speed      new speed      delta
JumpdestOpAnalysis/60-6  1.19GB/s ± 1%  1.39GB/s ± 4%  +16.52%  (p=0.008 n=5+5)
JumpdestOpAnalysis/0-6   1.21GB/s ± 0%  1.17GB/s ± 6%   -3.85%  (p=0.032 n=4+5)

[]uint64:

[user@work codeanalysis]$ bash test.sh
name                     old time/op    new time/op    delta
JumpdestOpAnalysis/60-6    1.11ms ± 3%    0.94ms ± 2%  -14.69%  (p=0.008 n=5+5)
JumpdestOpAnalysis/0-6     1.05ms ± 2%    1.14ms ± 7%   +8.65%  (p=0.008 n=5+5)

name                     old speed      new speed      delta
JumpdestOpAnalysis/60-6  1.11GB/s ± 3%  1.30GB/s ± 2%  +17.20%  (p=0.008 n=5+5)
JumpdestOpAnalysis/0-6   1.17GB/s ± 2%  1.08GB/s ± 6%   -7.84%  (p=0.008 n=5+5)

@randall77
Copy link
Contributor

I'm out of ideas.
I don't think there is a Go bug here - if anything this is just a "feature" of the processor. I'm going to close this issue, but please feel free to reopen if you find something non-optimal about the generated code.

@holiman
Copy link
Author

holiman commented Sep 10, 2021 via email

@golang golang locked and limited conversation to collaborators Sep 10, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants