-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: "unexpected result from waitpid" in TestGdbPython #39021
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I also got the same problem here: https://storage.googleapis.com/go-build-log/3d4facca/linux-amd64-staticlockranking_9d45d3a8.log It may be more likely when static lock ranking is enabled (because it is a timing issue?). I am also getting on some local staticlockranking runs. |
2020-05-12T19:15:34-cb11c98/linux-mips64le-mengzhuo This is looking like a recent regression. |
That one on |
There is also an
|
Attempting to bisect now, but I'm not hopeful because of the existing flakiness of the test (#24616). |
I'm unable to reproduce the failures at all on my local workstation using GDB 8.3.1 and
I'm not sure how to proceed from here. |
/cc @ianlancetaylor |
I've gotten exactly one local repro so far. I'm not sure what conditions are required to trigger it — system load seems to be a factor. However, the only repro I have gotten was without a test filter, which is even more difficult to use for bisection due to the presence of other flaky
|
If this is due to EINTR, might pummeling the process with signals help? |
An interesting idea, but it doesn't seem to help. |
@bcmills Just a friendly ping to check on the status of this as we work through our beta blocking issues. |
I don't know how to fix this and the tests are so nondeterministic that I can't figure out how to even bisect it. I plan to file a proposal (later today?) to relocate these tests so that they are not run as part of |
CL 232862 does seem like a likely candidate based on the CLs that landed around the time this issue started. Perhaps we should temporarily revert it and see if the problem disappears given that bisecting appears fruitless? |
Change https://golang.org/cl/235282 mentions this issue: |
After ~10 runs of go test runtime, I managed to repro this with golang.org/cl/235282 on an s390 gomote:
So GDB is segfaulting... I'll see if I can manage to find the core file. |
It seems like this must be some kind of GDB bug, though perhaps one we a tickling somehow. |
Looking at GDB source code, Just to be sure, is the GDB binary on the builder changed recently, like upgraded from one version to another? |
GDB 7.12 is released on 2016. Should we try some newer version of GDB on the builders? |
FWIW, the crashes on the s390 builder are on GDB 8.2, which is a bit newer (2018). Digging into the #39021 (comment) crash a bit more:
Thus is seems it should be possible to trigger this crash by sending a SIGWINCH to a batch mode GDB, though I haven't managed to do so. |
@chlunde noted the same |
All GDB tests currently ignore non-zero exit statuses. When tests flakes, we don't even know if GDB exited successfully or not. Add checks for non-zero exits, which are not expected. Furthermore, always log the output from GDB. The tests are currently inconsistent about whether they always log, or only on error. Updates #39021 Change-Id: I7af1d795fc2fdf58093cb2731d616d4aa44e9996 Reviewed-on: https://go-review.googlesource.com/c/go/+/235282 Run-TryBot: Bryan C. Mills <bcmills@google.com> Reviewed-by: Bryan C. Mills <bcmills@google.com> Reviewed-by: Ian Lance Taylor <iant@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>
I've dumped all of the core files on the s390 builder, and they all have the same stack trace as #39021 (comment)
Though it remains to be seen if all of the GDB failures are SEGVs with core files. Now that https://golang.org/cl/235282 is in we should be getting more information on new failures. |
Two new segfaults: 2020-05-27T19:54:12-0d20a49/linux-amd64-staticlockranking This is trivially reproducible with my system GDB with: package main
import (
"fmt"
"os/exec"
"syscall"
"time"
)
func main() {
cmd := exec.Command("gdb", "-nx", "-batch", "-ex", "run", "--args", "sleep", "60")
go func() {
// XXX: This isn't safe!
time.Sleep(1*time.Second)
fmt.Println("Sending SIGWINCH...")
syscall.Kill(cmd.Process.Pid, syscall.SIGWINCH)
}()
got, err := cmd.CombinedOutput()
fmt.Printf("output: %q, err: %v\n", got, err)
}
I'll see if I can report this upstream. My guess is that something has changed recently that is causing SIGWINCH to get sent to processes on the builders. I'm not sure what that would be though; AFAIK it is usually just desktop environments. EDIT: CombinedOutput sets stdin to /dev/null and stdout to a pipe, so the bash equivalent of this is: |
Upstream bug: https://sourceware.org/bugzilla/show_bug.cgi?id=26056 |
Thus far, the only true workaround I've found for this issue is to set GDB's stdin to /dev/tty. However, that's pretty ugly w.r.t. possible interactions with actual terminals. Not to mention it is probably annoying to write such that it works on all platforms (no idea if Darwin, Windows, etc have /dev/tty). On the other hand, why are we getting SIGWINCH in the first place? This test sends SIGWINCH to the entire process group, so if it runs in parallel to the GDB tests and ends up in the same process group (not sure about that?), then it could be signaling GDB. It is a pretty old test though, and this seems to be a new issue. |
Aha! I should read kill(2) more closely: "If pid equals 0, then sig is sent to every process in the process group of the calling process." The original blamed http://golang.org/cl/232862 has a new test with:
So that is presumably the source of the signals. I'll send a workaround. |
Argh, sorry about that, don't know what I was thinking. |
Change https://golang.org/cl/235557 mentions this issue: |
Could we run tests that sending signals to the process group not in parallel with other tests? |
2020-05-12T15:01:56-a0698a6/linux-amd64-longtest
This is the only occurrence I've seen so far. CC @ianlancetaylor in case it's somehow related to CL 232862, which went in one CL before this failure.
The text was updated successfully, but these errors were encountered: