-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: random performance fluctuations after unrelated changes #8717
Comments
If you want to try to figure out how this could be loop alignment, please go ahead. I spent days on this a few years ago and got absolutely nowhere. I can't find any evidence that loop alignment matters. It may be something else entirely, but in general modern CPUs are black boxes that can misbehave on a whim and - at least for people not privy to the inner workings - simply cannot be fully understood. They are subtle and quick to anger. If you want to try the loop alignment hypothesis, you could edit src/liblink/asm6.c. Look for LoopAlign (MaxLoopPad = 0 means alignment is turned off right now). I am removing the Release-Go1.5 tag. If someone wants to work on this, great, but I am not going to promise to waste any more time on this. Long ago I made peace with the fact that this kind of variation is something we just have to live with sometimes. Labels changed: added release-none, removed release-go1.5. Status changed to Accepted. |
I can confirm your conclusion. We need to wait until Go becomes important enough so that processor manufacturers allocate engineers for optimization. I've tried to align back-branch targets and all branch targets at 16 bytes (https://golang.org/cl/162890043) with no success. Alignment of back-branch targets increased binary size by 5.1%, all branches - 8.3%. So if we do it, we need something smarter, e.g. align only within real loops. I've checked that in both binaries stack segment address and fs register has equal values, so we can strike it out. Since code has moved, data segment also has a different address. So maybe it's related to data. But I don't see any functions in the profile that heavily access global data... |
Just debugged another case, which turned out to be this issue. go version devel +b4538d7 Wed May 11 06:00:33 2016 +0000 linux/amd64
Then depending on presence of the following patch:
The test program: With the call commented out I consistently see:
Without the call commented out:
All time is spent in computations:
drawPaletted magically becomes faster after the change. Diff in CPU profiles still does not make any sense to me, it looks like percents are just randomly shuffled. |
Loop alignment still makes the program slower. |
The the fast version the function is aligned on 0x10:
and in the slow version to 0x20:
If I set function alignment to 0x20 (which is broken due to asm functions, so it actually gives me 0x10 alignment for the function), it mostly fixes the problem:
|
I'm reminded of: From the second paper:
Microsoft Research had a tool that would link a program multiple times, where each binary used a different (randomized) function order, then they'd run tests and pick the best function order. Unfortunately, my Google-fu is failing me and I cannot find a reference. (The closest I can find is VC's /ORDER linker option, which looks like it could be used to implement this feature.) |
For the reference: the effect is fully explained here: |
(including way on how to denoise the timings) |
Change https://golang.org/cl/332771 mentions this issue: |
The text was updated successfully, but these errors were encountered: