-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: TestCgoPprof{,PIE} is flaky #37201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Change https://golang.org/cl/219417 mentions this issue: |
I suspect the test is flaky since it only collects 2+ cpuHog frames. I'm not familiar with how Pprof decides which samples to include in the profile, but it looks like in some rare cases cpuHog doesn't make the cut. The above log shows 0 samples reported in 200ms. That indicates the test loop exited early with at least 2 samples (since 200ms < 1 second loop timeout), but they weren't reported. If so, collecting more samples should significantly reduce the likelihood of a flake. I'm currently running a large number of tests to see if I can catch the flake. If that works, I should be able to confirm the above patch fixes the problem. |
I couldn't reproduce the flake on an unloaded machine, but I managed to reproduce it after ~13000 on a reasonably heavily loaded machine (lots of IO, CPU load and scheduling keeping the cores 100% busy).
Retrying with the patch now. |
I was unable to reproduce the flake with the patch after ~200000 tests. Without the patch it typically fails within 5000-20000 tests (both tested from +123f7dd3e1). I had a quick look through the profiling code and I can't see any good reason why frames should be dropped at all for this test. It's not clear to me why "2" frames were originally chosen. Going in the other direction, waiting for only a single cpuHog frame appears reliable on go1.13.7, but fails easily on go1.14rc1 (usually <100 tests). Running a bisect now to see where this test started failing intermittently. |
The test is flaky from 177a36a (runtime: implement async scheduler preemption) onwards. The test is reliable with The CGO traceback is definitely called (as noted above, I also confirmed with a trace write separately). There is some bad interaction between signal pre-emption and profiling which causes the sample to be lost. Perhaps this test should only use a single frame? That would have surfaced the issue much earlier. |
@bcmills @aclements @ianlancetaylor It might be worth investigating this issue further before Go 1.14 is released to confirm the damage is limited to loss of some profile samples (and not something more fundamental). |
The test doesn't count the number of profiling samples that are taken. It counts the number of times the So I don't think there is a fundamental problem here. It's a problem with the way that the test is written. It's not a problem that would happen with real code. Unfortunately I don't see a simple fix off hand, other than disabling signal based preemption for the test. |
Ah, great. Thanks for confirming it is expected that I've reverted my change so it waits for more |
Maybe we should simply disable signal-based preemption when running the test. The point of the test is just to see that we pick up a stack trace for |
Sorry, missed that suggestion earlier - much better idea. I've updated the deflake change. Given this flake was only noticed after signal preemption, it should more or less make it disappear again. I haven't been able to reproduce it after disabling preemption. |
The CGO traceback function is called whenever CGO code is executing and a signal is received. This occurs much more frequently now SIGURG is used for preemption. Disable signal preemption to significantly increase the likelihood that a signal results in a profile sample during the test. Updates #37201 Change-Id: Icb1a33ab0754d1a74882a4ee265b4026abe30bdc Reviewed-on: https://go-review.googlesource.com/c/go/+/219417 Run-TryBot: Emmanuel Odeke <emm.odeke@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Ian Lance Taylor <iant@golang.org>
FYI, I left this issue open in case this issue should track fixing the test approach and/or runtime. |
Still flaky after CL 219417: |
I also got the failure with a recently rebased changed: https://storage.googleapis.com/go-build-log/3d4facca/linux-amd64-race_bb62fb2a.log |
Including both TestCgoPprofPIE and TestCgoPprof, it looks like this got quite a bit worse around 2020-05-11: I've also seen it in quite a few TryBot results recently. Marking as release-blocker for Go 1.15 due to the likelihood of a regression here. |
@bcmills Can I make this a non-beta1 blocking issue? |
Yep, that seems fine. |
I just got a failure at https://storage.googleapis.com/go-build-log/9b94d0b5/linux-amd64-race_d785929a.log --- FAIL: TestCgoPprof (3.47s)
crash_cgo_test.go:317: [/workdir/go/bin/go tool pprof -traces /workdir/tmp/go-build787198486/testprogcgo_.exe /workdir/tmp/prof847643684]:
File: testprogcgo_.exe
Build ID: a25028f2fff69a0864f81c4cca9180f2f354c52b
Type: cpu
Time: May 29, 2020 at 10:00am (UTC)
Duration: 200.66ms, Total samples = 0
-----------+-------------------------------------------------------
crash_cgo_test.go:325: cpuHog traceback missing.
crash_cgo_test.go:317: [/workdir/go/bin/go tool pprof -traces /workdir/tmp/prof847643684]:
File: testprogcgo_.exe
Build ID: a25028f2fff69a0864f81c4cca9180f2f354c52b
Type: cpu
Time: May 29, 2020 at 10:00am (UTC)
Duration: 200.66ms, Total samples = 0
-----------+-------------------------------------------------------
crash_cgo_test.go:325: cpuHog traceback missing.
FAIL
FAIL runtime 54.956s
FAIL
go tool dist: Failed: exit status 1 |
Could this be similar to #39021 , in that the program gets signaled with SIGWINCH that sent to the the process group, and triggered pprofCgoTraceback to run too early? |
I think CL https://golang.org/cl/235557 also fixes this. On my machine it was ~10% failure rate (running runtime test in a loop), after applying that CL it doesn't fail with 100+ iterations. |
This hasn't happened since CL 235557 was committed, and findflakes estimates a 0.00% probability that this is still broken. Closing. |
2020-02-12T18:22:50-363bcd0/linux-ppc64le-power9osu
2020-02-07T18:08:01-b806182/linux-amd64-jessie
Given the timing of the logs, it's not clear to me whether this is a regression in Go 1.14 or simply an already-flaky test coming to the surface.
CC @mknyszek @aclements @hyangah @ianlancetaylor @mpx
The text was updated successfully, but these errors were encountered: