-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: performance scaling degradation in some Chan benchmarks when futexes became private #25625
Comments
Could the problem be that the futex code is faster, causing more contention on the memory cache lines? I benchmarked the change on amd64 and didn't see any significant differences. The hope was that the system load would be a bit less overall, and in any case using the |
Can you benchmark on ppc64le with tip and with tip with |
Yes it looks like the time spent for in the futex code is faster and there is more contention with the Xchg and Cas. But I'm surprised it causes such a huge degradation.
My first experiment was to build with the commit where the futex was made private, and I compared that against the one right before and that's when the degradation occurs. I will also try your experiment with latest. |
If there is a CAS involved, that could be the problem: it could be that the futex used to serialize the CAS operations so that they always succeeded, but now the code is fast enough that they can fail multiple times (presumably with a cache miss on each iteration). |
Here are the results comparing latest with and without setting futex to private. I also reverted back to the way benchmark timings are done with count > 1 (#25622) to avoid flaky results. old = latest
|
Thanks for the measurements. It seems unfortunate to throw sand in the gears to reduce memory contention during microbenchmarks. My reading of the futex code is that it more or less assumes that Similarly we should probably implement At least according to https://sourceware.org/ml/libc-alpha/2012-11/msg00761.html, the |
Carlos is looking into the procyield. We have found that these results may be related to the kernel on the machine where I did the initial testing, i.e., it was old. I am trying newer kernels and those don't seem to have the same degradation. Let us dig into this more before you spend any more time on it. |
@ianlancetaylor I created a procyield() implementation for ppc64x. I did not use the 'yield' hint, though, as it is not recommended anymore in the ISA 3.0. I used the Program Priority Register instead. The results are pretty good compared against the current master:
If you think it's reasonable, I can submit it now during the freeze. |
@ceseo You should note what kernel those results were collected on and that the futex private change did not show a degradation on that kernel. |
It doesn't degrade on 4.12. Here's 1.10 vs master:
|
Btw, there are two kernel patches that might be connected with this degradation gone in recent kernels. See kernel commits fd851a3cdc196 and ede8e2bbb0eb3. |
@ceseo Sure, let's put it in for 1.11. Thanks. |
Change https://golang.org/cl/115376 mentions this issue: |
The degradation with private futexes happens on rhel7 power8 systems using kernel 3.10. We've run on other Ubuntu and SUSE systems that are 4.10 or higher, power8 and 9, and the degradation does not occur on those. |
This might be something to mention in the release notes for go 1.11? Something like: With the change to private futexes in go 1.11, it is possible to see a performance degradation in some channel functions if running on Linux kernels earlier than 4.x on ppc64le. Moving to a kernel 4.0 or later avoids the problem. |
The procyield() function should yield the processor as in other architectures. On ppc64x, this is achieved by setting the Program Priority Register to 'low priority' prior to the spin loop, and setting it back to 'medium-low priority' afterwards. benchmark old ns/op new ns/op delta BenchmarkMakeChan/Byte-8 87.7 86.6 -1.25% BenchmarkMakeChan/Int-8 107 106 -0.93% BenchmarkMakeChan/Ptr-8 201 204 +1.49% BenchmarkMakeChan/Struct/0-8 78.2 79.7 +1.92% BenchmarkMakeChan/Struct/32-8 196 200 +2.04% BenchmarkMakeChan/Struct/40-8 236 230 -2.54% BenchmarkChanNonblocking-8 8.64 8.85 +2.43% BenchmarkChanUncontended-8 5577 5598 +0.38% BenchmarkChanContended-8 66106 51529 -22.05% BenchmarkChanSync-8 451 441 -2.22% BenchmarkChanSyncWork-8 9155 9170 +0.16% BenchmarkChanProdCons0-8 1585 1083 -31.67% BenchmarkChanProdCons10-8 1094 838 -23.40% BenchmarkChanProdCons100-8 831 657 -20.94% BenchmarkChanProdConsWork0-8 1471 941 -36.03% BenchmarkChanProdConsWork10-8 1033 721 -30.20% BenchmarkChanProdConsWork100-8 730 511 -30.00% BenchmarkChanCreation-8 135 128 -5.19% BenchmarkChanSem-8 602 463 -23.09% BenchmarkChanPopular-8 3017466 2188441 -27.47% Fixes golang#25625 Change-Id: Iacb1c888d3c066902152b8367500348fb631c5f9 Reviewed-on: https://go-review.googlesource.com/115376 Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
What version of Go are you using (
go version
)?tip
Does this issue reproduce with the latest release?
yes
What operating system and processor architecture are you using (
go env
)?ppc64le, not sure about others
What did you do?
In comparing performance for some package benchmarks using latest against 1.10, it was noted that some of the runtime Chan benchmarks degraded by 2-4X. The degradation was tracked down to the change where futexes became private.
07751f4 runtime: use private futexes on Linux
What did you expect to see?
No significant change.
What did you see instead?
Significant degradation in the following:
The degradation is worse with higher values of GOMAXPROCS when comparing go 1.10 against latest.
According the to profiles, more time spent in the atomic.Xchg statements in the lock function in runtime/lock_futex.go after change to make futexes private, less time spent in the futex.
@ianlancetaylor
The text was updated successfully, but these errors were encountered: