-
Notifications
You must be signed in to change notification settings - Fork 18k
cmd/compile: lower performance in tip and AMD64=v3 #59225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
cc @golang/compiler |
I am not seeing the same thing you are seeing. I ran with 1.20.2 and tip, GOAMD64=v1 and v3. They all seem almost identical.
My processor is
|
Same on this machine, no significant performance difference:
|
I modified the benchmark loop to call Checksum
So the generated code has something going on that impacts instructions per cycle or perf stat is just unable to report the relevant perf counters to me. If there's nothing silly going in the assembly then I'm happy to accept that my CPU has a less wrinkly brain than most. |
At v1, the code uses |
I modified the Checksum function to process in chunks of 48 bytes instead of 64. This did the trick and now both GOAMD64=v1 and v3 get similar performance. GOAMD64=v1 BenchmarkChecksum 1 3627673852 ns/op 14113.73 MB/s Looking at the generated code the only real thing that GOAMD64=v3 ever did was to convert MOVQ+BSWAP pair to a MOVBE. With 56 byte chunks I got the same performance as with 64 so I guess 48 bytes is a sweet spot for my CPU and that's that. |
I did notice that 64-byte chunks needed one more register than was available, so it had to spill a bit. 48-byte chunks would probably avoid that. That may be related to what you are seeing (it's unhappy with |
For both 64-byte 56-byte chunks it also seems to be spilling the slice length. But I might be misreading the assembly. Also go tool objdump tells me that there's a sequence of 6 NOPL instructions in the hot loop after all the ADCQs. Why is there so many of them in a sequence? For 64-byte chunks version it's a sequence of 7 NOPLs. |
Those NOPs mark the inlining callsites of the |
Alright, I now finally realized to check how the assembly diffs between GOAMD64=v1 go1.20.2 and GOAMD64=v1 gotip for the Checksum function. The difference boils down to those NOPLs just being in a different location yet the performance goes from ~14000MB/s -> ~10000MB/s. |
The nop thing is tracked on a different issue, so we can close this. |
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes
What did you do?
What did you expect to see?
gotip and GOAMD64=v3 go1.20.2 to maintain the performance of go1.20.2
What did you see instead?
For go1.20.2 GOAMD64=v3 I get similar performance as for gotip
The text was updated successfully, but these errors were encountered: