cmd/compile: lower performance in tip and AMD64=v3 #59225

rip-create-your-account · 2023-03-24T16:35:41Z

What version of Go are you using (`go version`)?

$ gotip version
go version devel go1.21-a6c382eaa8 Fri Mar 24 14:23:19 2023 +0000 linux/amd64

$ go1.20.2 version
go version go1.20.2 linux/amd64

Does this issue reproduce with the latest release?

Yes

What did you do?

package main

import (
        "encoding/binary"
        "math/bits"
        "math/rand"
        "testing"
)

func BenchmarkChecksum(b *testing.B) {
        bb := make([]byte, 1024)
        r := rand.New(rand.NewSource(1))

        initial := uint16(r.Uint32())
        r.Read(bb)

        b.ResetTimer()

        var out uint16
        for i := 0; i < b.N; i++ {
                out = Checksum(bb, initial)
        }
        _ = out

        b.SetBytes(int64(len(bb)))
}

func Checksum(buf []byte, _ uint16) uint16 {
        var off int
        var sum, carry uint64
        for ; off <= len(buf)-64; off += 64 {
                chunk := (*[64]byte)(buf[off : off+64 : off+64])

                {
                        n1 := binary.BigEndian.Uint64(chunk[8*0:])
                        sum, carry = bits.Add64(sum, n1, carry)
                        n2 := binary.BigEndian.Uint64(chunk[8*1:])
                        sum, carry = bits.Add64(sum, n2, carry)
                        n3 := binary.BigEndian.Uint64(chunk[8*2:])
                        sum, carry = bits.Add64(sum, n3, carry)
                        n4 := binary.BigEndian.Uint64(chunk[8*3:])
                        sum, carry = bits.Add64(sum, n4, carry)
                }

                {
                        n1 := binary.BigEndian.Uint64(chunk[8*4:])
                        sum, carry = bits.Add64(sum, n1, carry)
                        n2 := binary.BigEndian.Uint64(chunk[8*5:])
                        sum, carry = bits.Add64(sum, n2, carry)
                        n3 := binary.BigEndian.Uint64(chunk[8*6:])
                        sum, carry = bits.Add64(sum, n3, carry)
                        n4 := binary.BigEndian.Uint64(chunk[8*7:])
                        sum, carry = bits.Add64(sum, n4, carry)
                }
        }
        return uint16(sum)
}

What did you expect to see?

gotip and GOAMD64=v3 go1.20.2 to maintain the performance of go1.20.2

What did you see instead?

go1.20.2 test --bench Checksum checksum_test.go
goos: linux
goarch: amd64
cpu: Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
BenchmarkChecksum-8     15659816                72.35 ns/op     14154.40 MB/s

gotip test --bench Checksum checksum_test.go
goos: linux
goarch: amd64
cpu: Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
BenchmarkChecksum-8     12009291                94.35 ns/op     10852.66 MB/s

For go1.20.2 GOAMD64=v3 I get similar performance as for gotip

GOAMD64=v3 go1.20.2 test --bench Checksum checksum_test.go
goos: linux
goarch: amd64
cpu: Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
BenchmarkChecksum-8     12212551                97.91 ns/op     10458.96 MB/s

The text was updated successfully, but these errors were encountered:

heschi · 2023-03-24T17:26:51Z

cc @golang/compiler

randall77 · 2023-03-24T19:45:40Z

I am not seeing the same thing you are seeing. I ran with 1.20.2 and tip, GOAMD64=v1 and v3. They all seem almost identical.

% benchstat 1202v1 1202v3 tipv1 tipv3
name \ time/op  1202v1         1202v3         tipv1          tipv3
Checksum-16       55.5ns ± 0%    55.9ns ± 1%    56.5ns ± 0%    56.5ns ± 0%

My processor is

goos: darwin
goarch: amd64
cpu: Intel(R) Xeon(R) W-3223 CPU @ 3.50GHz

randall77 · 2023-03-24T19:53:38Z

Same on this machine, no significant performance difference:

goos: linux
goarch: amd64
cpu: 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz

rip-create-your-account · 2023-03-24T20:22:31Z

I modified the benchmark loop to call Checksum 50*1000*1000 times so that b.N=1 and then run with go test ... --exec "perf stat". Here are the results for go1.20.2 GOAMD64=v1 and v3:

GOAMD64=v1 go1.20.2 test --bench Checksum checksum_test.go --exec "perf stat"
BenchmarkChecksum              1        3602958906 ns/op        14210.54 MB/s
PASS

           3604.05 msec task-clock                       #    1.000 CPUs utilized          
               935      context-switches                 #  259.430 /sec                   
                 1      cpu-migrations                   #    0.277 /sec                   
               346      page-faults                      #   96.003 /sec                   
       14349593850      cycles                           #    3.982 GHz                    
       39768319694      instructions                     #    2.77  insn per cycle         
        2653987967      branches                         #  736.389 M/sec                  
             67098      branch-misses                    #    0.00% of all branches

GOAMD64=v3 go1.20.2 test --bench Checksum checksum_test.go --exec "perf stat"
BenchmarkChecksum              1        4906196573 ns/op        10435.78 MB/s
PASS

           4907.21 msec task-clock                       #    1.000 CPUs utilized          
              1216      context-switches                 #  247.799 /sec                   
                 1      cpu-migrations                   #    0.204 /sec                   
               349      page-faults                      #   71.120 /sec                   
       19548126041      cycles                           #    3.984 GHz                    
       33372351551      instructions                     #    1.71  insn per cycle         
        2654871068      branches                         #  541.015 M/sec                  
             77628      branch-misses                    #    0.00% of all branches

So the generated code has something going on that impacts instructions per cycle or perf stat is just unable to report the relevant perf counters to me. If there's nothing silly going in the assembly then I'm happy to accept that my CPU has a less wrinkly brain than most.

randall77 · 2023-03-24T21:00:36Z

At v1, the code uses MOVQ + BSWAPQ. At v3, it uses MOVBE instead (combined load+byteswap).
It sounds like your chip really doesn't like MOVBE for some reason.

rip-create-your-account · 2023-03-24T21:01:14Z

I modified the Checksum function to process in chunks of 48 bytes instead of 64. This did the trick and now both GOAMD64=v1 and v3 get similar performance.

GOAMD64=v1 BenchmarkChecksum 1 3627673852 ns/op 14113.73 MB/s
GOAMD64=v3 BenchmarkChecksum 1 3639348795 ns/op 14068.45 MB/s

Looking at the generated code the only real thing that GOAMD64=v3 ever did was to convert MOVQ+BSWAP pair to a MOVBE. With 56 byte chunks I got the same performance as with 64 so I guess 48 bytes is a sweet spot for my CPU and that's that.

randall77 · 2023-03-24T21:05:10Z

I did notice that 64-byte chunks needed one more register than was available, so it had to spill a bit. 48-byte chunks would probably avoid that. That may be related to what you are seeing (it's unhappy with MOVBE followed immediately by a store?).

rip-create-your-account · 2023-03-24T21:45:12Z

For both 64-byte 56-byte chunks it also seems to be spilling the slice length. But I might be misreading the assembly. Also go tool objdump tells me that there's a sequence of 6 NOPL instructions in the hot loop after all the ADCQs. Why is there so many of them in a sequence? For 64-byte chunks version it's a sequence of 7 NOPLs.

randall77 · 2023-03-24T21:59:27Z

Those NOPs mark the inlining callsites of the binary.Uint64 calls. They are used for tracebacks.
Removing them is #29571

rip-create-your-account · 2023-03-24T22:18:41Z

Alright, I now finally realized to check how the assembly diffs between GOAMD64=v1 go1.20.2 and GOAMD64=v1 gotip for the Checksum function. The difference boils down to those NOPLs just being in a different location yet the performance goes from ~14000MB/s -> ~10000MB/s.
diff go1.20.2vsgotip.txt
Understanding why this is the case is above my pay-grade. Feel free to close the issue if there's nothing to really do. I already figured out the fix for me.

randall77 · 2023-03-24T22:53:41Z

The nop thing is tracked on a different issue, so we can close this.

gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Mar 24, 2023

heschi added this to the Go1.21 milestone Mar 24, 2023

heschi added the NeedsFix The path to resolution is known, but the work has not been done. label Mar 24, 2023

randall77 added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Mar 24, 2023

randall77 closed this as completed Mar 24, 2023

golang locked and limited conversation to collaborators Mar 24, 2024

gopherbot added the FrozenDueToAge label Mar 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: lower performance in tip and AMD64=v3 #59225

cmd/compile: lower performance in tip and AMD64=v3 #59225

rip-create-your-account commented Mar 24, 2023

heschi commented Mar 24, 2023

randall77 commented Mar 24, 2023

randall77 commented Mar 24, 2023

rip-create-your-account commented Mar 24, 2023

randall77 commented Mar 24, 2023

rip-create-your-account commented Mar 24, 2023

randall77 commented Mar 24, 2023

rip-create-your-account commented Mar 24, 2023

randall77 commented Mar 24, 2023

rip-create-your-account commented Mar 24, 2023

randall77 commented Mar 24, 2023

cmd/compile: lower performance in tip and AMD64=v3 #59225

cmd/compile: lower performance in tip and AMD64=v3 #59225

Comments

rip-create-your-account commented Mar 24, 2023

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What did you do?

What did you expect to see?

What did you see instead?

heschi commented Mar 24, 2023

randall77 commented Mar 24, 2023

randall77 commented Mar 24, 2023

rip-create-your-account commented Mar 24, 2023

randall77 commented Mar 24, 2023

rip-create-your-account commented Mar 24, 2023

randall77 commented Mar 24, 2023

rip-create-your-account commented Mar 24, 2023

randall77 commented Mar 24, 2023

rip-create-your-account commented Mar 24, 2023

randall77 commented Mar 24, 2023

What version of Go are you using (`go version`)?