-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: frequent enlisting of short-lived background workers leads to performance regression with async preemption #37465
Comments
Somebody proposed to remove 'ReportAllocs'. It has an impact, but not very big:
|
Turning off DEP seems to reduce the impact, but still considerably slower:
This actually matches my initial measurements before isolating the issue, so I am unsure if this is just some variance based on system state. |
cc @mknyszek |
/cc @aclements @rsc @randall77 |
This comment has been minimized.
This comment has been minimized.
Or... hm. I spoke too soon. 5 runs shows that the benchmark is pretty noisy on Linux and it's not obvious that there's a regression.
That's what I get for making conclusions on just one benchmark run. This probably needs many more runs to see a concrete difference, if one exists. Since this seems much more easily reproducible on Windows, I'm going to try there. |
Cannot reproduce on linux. $ go env GO111MODULE="" GOARCH="amd64" GOBIN="" GOCACHE="/home/leaf/.cache/go-build" GOENV="/home/leaf/.config/go/env" GOEXE="" GOFLAGS="" GOHOSTARCH="amd64" GOHOSTOS="linux" GOINSECURE="" GONOPROXY="" GONOSUMDB="" GOOS="linux" GOPATH="/home/leaf/go" GOPRIVATE="" GOPROXY="https://proxy.golang.org,direct" GOROOT="/usr/lib/go-1.14" GOSUMDB="sum.golang.org" GOTMPDIR="" GOTOOLDIR="/usr/lib/go-1.14/pkg/tool/linux_amd64" GCCGO="gccgo" AR="ar" CC="gcc" CXX="g++" CGO_ENABLED="1" GOMOD="" CGO_CFLAGS="-g -O2" CGO_CPPFLAGS="" CGO_CXXFLAGS="-g -O2" CGO_FFLAGS="-g -O2" CGO_LDFLAGS="-g -O2" PKG_CONFIG="pkg-config" GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build636896530=/tmp/go-build -gno-record-gcc-switches" $ ~/go/bin/benchstat bench_1.1{3,4}.txt
Edit: benchmark run with ReportAlloc removed. |
While I set up my Windows environment: @klauspost what suggests to you that this is a regression in allocation performance vs. something else? |
@mknyszek It is only really doing allocations. The compression part is trivial, but it needs enough content to activate all allocations. Profiles: alloc-pprof.zip
|
SVGs: Go 1.13: https://files.klauspost.com/profile002.svg |
FWIW, I ran it on Linux Arm64 as well (shared VM so anything goes), but the results seems consitent:
$ go env GO111MODULE="" GOARCH="arm64" GOBIN="" GOCACHE="/home/leitzler/.cache/go-build" GOENV="/home/leitzler/.config/go/env" GOEXE="" GOFLAGS="" GOHOSTARCH="arm64" GOHOSTOS="linux" GOINSECURE="" GONOPROXY="" GONOSUMDB="" GOOS="linux" GOPATH="/home/leitzler/go" GOPRIVATE="" GOPROXY="https://proxy.golang.org,direct" GOROOT="/usr/local/go" GOSUMDB="sum.golang.org" GOTMPDIR="" GOTOOLDIR="/usr/local/go/pkg/tool/linux_arm64" GCCGO="gccgo" AR="ar" CC="gcc" CXX="g++" CGO_ENABLED="1" GOMOD="" CGO_CFLAGS="-g -O2" CGO_CPPFLAGS="" CGO_CXXFLAGS="-g -O2" CGO_FFLAGS="-g -O2" CGO_LDFLAGS="-g -O2" PKG_CONFIG="pkg-config" GOGCCFLAGS="-fPIC -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build024325012=/tmp/go-build -gno-record-gcc-switches" |
@klauspost Thanks for the profiles! I was about to do this myself. I disagree that this indicates anything to with allocation; none of those symbols are involved in allocation. The first one I see is in the top 20 in the diff, and it's right at the bottom (and it's a deleted symbol at that!).
What's more interesting to me is the prevalence of
It looks like a lock in @klauspost Your SVGs suggest the same thing (look at the difference between stdcall2 in both) AFAICT. CC @aclements because asynchronous preemption. |
@leitzler Could you share some profiles? I don't readily have access to dedicated arm64 hardware and I'm curious if that's the same thing or a different issue. |
@aclements Most of it is coming from |
Digging deeper with @cherrymui, it looks like most of the preemption requests are coming from There's a comment there about #19112 which talks about deadlocks for why we can't just take an idle P... |
@leitzler Can you please file a new issue? It's almost certainly unrelated to this one, because this is all about Windows-specific code. |
@mknyszek No expert on this, but it could be processor specific AFAICT. Running on Ryzen 3950x here - 16c/32t. I will run on my 6c/12t Intel laptop and see if it reproduces as well. Edit: 6c/12t: Shows 24% degradation as well. HOWEVER, by running with |
I’ll give it another run with profiles later tonight when I get home. (In a new issue depending on how this turns out meanwhile) |
The number of threads affects the numbers quite a bit: commandline:
Even though the test is single-threaded, the total number of threads available has a very big impact. I am rerunning the benchmark with Edit: To confirm the numbers, here is 5 runs of each:
(benchmark auto-terminated, but the numbers for gzip-16 and gzip-32 should be confirmed as well). |
Yeah, I think this is consistent with our previous analysis. There is no contention if there is just one thread. The more threads, the more threads trying to preempt each other, thus the heavier contention. |
It isn't dedicated hardware so I'm not sure it is that much worth, it's a 8 vcpu 8 gb VM. I re-ran the test, 10 times each alternating between 1.13 & 1.14 and it was actually way less difference:
|
@mknyszek I don't know why you changed the title to 'compression benchmark'. This is not related to compression. Change the benchmark to:
and the result is:
|
@klauspost That's true, thanks for pointing it out. I'll fix it again. @leitzler Looking at the profiles... it looks like it's exactly the same issue. |
So I think fixing #19112 is feasible... but it's tricky. Anything we do would basically need an invariant that you can't try to change a G's status while holding The fix is really simple, I think: make The alternative is to have enlistWorker hand off the "wakep" to somebody else (like a "p waker" goroutine or sysmon or something) but the delay there may be too high. @aclements and @ianlancetaylor, WDYT? Yet another alternative is to not try to solve this problem and instead avoid having enlistWorker fire so often. This would require a heuristic or something for when more mark workers are actually needed, and would require more thought. |
@mknyszek Well, what I think is that I think someone should review https://golang.org/cl/216198 which seems to already do what you want. |
@ianlancetaylor Ah. Well, I totally forgot about that patch. :) Happy to review it. |
Change https://golang.org/cl/223797 mentions this issue: |
Should this have Milestone 1.15 and backport to 1.14? |
@networkimprov Milestone 1.15 definitely. Backport to 1.14 might be possible, it's a bit risky though (just https://golang.org/cl/223797 isn't actually enough; the change it depends on is larger). |
@klauspost I haven't gotten around to verifying this myself yet (some noise problems with the Windows box I have access to), but it would still be good to make sure the change fixes the problem for you (perhaps there's a difference between our Windows systems; and it also looks like maybe you have a less noisy system). Could you please try to run your benchmark with Go at the current master branch and https://golang.org/cl/223797 applied on top? |
I will attempt to verify it tomorrow 👍 |
@klauspost The freeze is coming soon, friendly ping! :) Please give it a try if you get the chance or let me know if you can't. |
Sorry @mknyszek . Didn't have a build system set up, so that got me sidetracked. Unfortunately it doesn't seem to make any measurable difference for me and the original regression still stands.
|
Hmmm... I see. Thank you for trying! This needs more investigation, then. That CL is still probably worth landing (if only to close #19112). |
Moving milestone to 1.16. |
Unfortunately, CL 223797 still has some lock ordering issues, so we've decided it's safer to bump this to 1.17. |
Kicking this to the backlog. I don't have time before the release to try to get this change in. It's pretty risky, and doesn't even resolve the issue here. To that point, @tklauser it's been a long time, but is still something observable to you? Did anything improve? |
@klauspost I believe the previous comment was intended for you :-) |
Still seeing pretty much the same performance as 1.14:
|
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes.
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
Run benchmark: https://play.golang.org/p/WeuJg6yaOuJ
go test -bench=. -test.benchtime=10s
used to test.What did you expect to see?
Close or similar benchmark speeds.
What did you see instead?
40% performance regression.
This is not a purely theoretical benchmark. While suboptimal, this is the easiest way to compress a piece of data, so this will be seen in the wild. It could also indicate a general regression for applications allocating a lot.
Edit: This is not related to changes in the referenced packages. Seeing this when using packages outside the stdlib as well.
The text was updated successfully, but these errors were encountered: