-
Notifications
You must be signed in to change notification settings - Fork 18k
cmd/compile: poor register allocator behavior in compression code #16122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'm not seeing what you are seeing. That said, the code for the inner loop in decode isn't very good. The register allocator uses more copies than are necessary. The indexing produces code like:
The latter is a single complex-addressing mode load, and the x+1 was folded into it. I'll try to fix this up during the 1.8 cycle. |
OK. I suppose the issue has been addressed between 1.7.beta1 and tip then. What remains is an optimization. |
@randall77, anything left here for Go 1.8? |
No, I did not get to fixing anything else here. Punting to 1.9. |
I grabbed commit 3872c76b9b410c44428e74b2065f3e2291cb8095 of github.com/flanglet/kanzi based on the date of the commit and of this issue, extracted all relevant code into a single file, and updated it to use regular test/benchmark form: https://gist.github.com/josharian/e0bc6e238d4914a44289b44bc4ae3640. This shows a steady regression over time (tip at b53acd8):
|
CL 43491 applied to tip helps a fair amount:
That brings it back below 1.8 levels, although still not as good as 1.7. |
CL 43491 is in, so we're back below the 1.8 level. Changing to milestone 1.10 for the original goal of getting back to 1.7 levels. |
I spent some time (intermission at a concert) noodling with pictures of loops, and it seems to me that there's families of loops where we can arrange to have "optimal" answers, and we should probably arrange to do that in the block ordering code and then see what's left for heuristics. For example, if there is a block in the loop that dominates all the exits and itself is conditional with an exit successor, rotate that block to the bottom, and its in-loop successor to the beginning of the run of blocks for the loop. If we have a diamond with an exit on one of the two arms, A -> (B,C), B -> (X,D), C -> D, D -> A, order it C D A B. In "layout", we'd do this by detecting transition into a loop, decoding the loop type, and inferring the best start block if it fits one of our models, and I think that the rest just falls out -- i.e., if you start with C, then the only logical successor is D then A, then B. This might subsume the loop rotation phase. |
go1.10beta1 shows a 15% performance regression on this code vs. go1.9.2 windows/amd64
go1.10beta1 windows/amd64
Problematic lines: 166 and 194
|
@flanglet I don't see the same performance issues you do.
There's a lot of noise in this benchmark; it varies a lot from run to run. That makes it hard to make any definitive statements. |
I also do not see much difference on a Linux machine. But the performance drop is reproducible every time the benchmark runs on the Windows 7 machine.
Increase the test count. I will work on reducing the noise. |
@flanglet : Then I don't understand what is going on. The register allocator is identical on linux and windows, so this may be an entirely separate problem from the one that started this issue. |
The Windows 7 machine has an Intel Core7 2600 (Sandy bridge) CPU. |
@TocarIP , might be up your alley. |
Couldn't reproduce on Sandy Bridge |
Well, that is weird. It happens every time on my machine. |
Please answer these questions before submitting your issue. Thanks!
go version
)?go1.6 windows/amd64 and go1.7beta1 windows/amd64
go env
)?set GOARCH=amd64
set GOBIN=
set GOEXE=.exe
set GOHOSTARCH=amd64
set GOHOSTOS=windows
set GOOS=windows
set GOPATH=E:\Users\fred\Documents\Prog\kanzi\go
set GORACE=
set GOROOT=E:\Program Files\go
set GOTOOLDIR=E:\Program Files\go\pkg\tool\windows_amd64
set GO15VENDOREXPERIMENT=1
set CC=gcc
set GOGCCFLAGS=-m64 -mthreads -fmessage-length=0
set CXX=g++
set CGO_ENABLED=1
I ran "go build TestZRLT.go" then "TestZRLT.exe" both for Go 1.6 and 1.7 beta1
The source code is very simple: https://github.com/flanglet/kanzi/blob/master/go/src/kanzi/test/TestZRLT.go.
It runs a correctness and a performance test tor the Zero Run Length Transform:
https://github.com/flanglet/kanzi/blob/master/go/src/kanzi/function/ZRLT.go.
I expected to see no performance regression from 1.6 to 1.7beta1
ZRLT encoding is much faster with 1.7beta1 but decoding is much slower.
Output for 1.6:
Speed test
Iterations: 50000
ZRLT encoding [ms]: 10694
Throughput [MB/s]: 222
ZRLT decoding [ms]: 7419
Throughput [MB/s]: 321
ZRLT encoding [ms]: 10753
Throughput [MB/s]: 221
ZRLT decoding [ms]: 7472
Throughput [MB/s]: 319
ZRLT encoding [ms]: 10724
Throughput [MB/s]: 222
ZRLT decoding [ms]: 7393
Throughput [MB/s]: 322
Output for 1.7beta1:
Speed test
Iterations: 50000
ZRLT encoding [ms]: 6834
Throughput [MB/s]: 348
ZRLT decoding [ms]: 11560
Throughput [MB/s]: 206
ZRLT encoding [ms]: 6828
Throughput [MB/s]: 349
ZRLT decoding [ms]: 11589
Throughput [MB/s]: 205
ZRLT encoding [ms]: 6790
Throughput [MB/s]: 351
ZRLT decoding [ms]: 11558
Throughput [MB/s]: 206
I narrowed down the issue to the run length decoding loop:
If I replace 'for val <= 1 {' with 'for val&1 == val {', the decoding becomes much faster (although not as fast as with Go 1.6)
Output for 1.7beta1 with code change:
Speed test
Iterations: 50000
ZRLT encoding [ms]: 6800
Throughput [MB/s]: 350
ZRLT decoding [ms]: 7669
Throughput [MB/s]: 310
ZRLT encoding [ms]: 6813
Throughput [MB/s]: 349
ZRLT decoding [ms]: 7689
Throughput [MB/s]: 310
ZRLT encoding [ms]: 6775
Throughput [MB/s]: 351
ZRLT decoding [ms]: 7662
Throughput [MB/s]: 311
The text was updated successfully, but these errors were encountered: