runtime: "unexpected result from waitpid" in TestGdbPython #39021

bcmills · 2020-05-12T15:35:20Z

2020-05-12T15:01:56-a0698a6/linux-amd64-longtest

--- FAIL: TestGdbPanic (0.85s)
    runtime-gdb_test.go:71: gdb version 7.12
    runtime-gdb_test.go:616: could not find '(#.* .* in )?main\.crash' in backtrace
    runtime-gdb_test.go:617: gdb output:
        Loading Go Runtime support.
--- FAIL: TestGdbPython (0.98s)
    runtime-gdb_test.go:71: gdb version 7.12
    runtime-gdb_test.go:249: gdb output: Loading Go Runtime support.
        Loaded  Script                                                                 
        Yes     /workdir/go/src/runtime/runtime-gdb.py                                 
        Breakpoint 1 at 0x498865: file /workdir/tmp/go-build240230015/main.go, line 16.
        warning: linux_test_for_tracefork: unexpected result from waitpid (52539, status 0x1c7f)
        hi
        
    runtime-gdb_test.go:278: info goroutines failed: 
FAIL
FAIL	runtime	157.848s

This is the only occurrence I've seen so far. CC @ianlancetaylor in case it's somehow related to CL 232862, which went in one CL before this failure.

The text was updated successfully, but these errors were encountered:

danscales · 2020-05-12T23:24:51Z

I also got the same problem here:

https://storage.googleapis.com/go-build-log/3d4facca/linux-amd64-staticlockranking_9d45d3a8.log

It may be more likely when static lock ranking is enabled (because it is a timing issue?). I am also getting on some local staticlockranking runs.

bcmills · 2020-05-13T03:50:33Z

2020-05-12T19:15:34-cb11c98/linux-mips64le-mengzhuo
2020-05-12T19:15:34-cb11c98/linux-s390x-ibm

This is looking like a recent regression.

bcmills · 2020-05-13T04:02:27Z

That one on linux-s390x-ibm looks related to #39025.

bcmills · 2020-05-13T04:04:13Z

There is also an info goroutines failed: failure mode in TestGdbPythonCgo that is likely related:
2020-05-12T19:15:34-cb11c98/linux-amd64-sid
2020-05-12T19:15:34-cb11c98/linux-mips64le-mengzhuo

--- FAIL: TestGdbPythonCgo (1.58s)
    runtime-gdb_test.go:71: gdb version 8.3
    runtime-gdb_test.go:249: gdb output: Loading Go Runtime support.
        Loading Go Runtime support.
        Breakpoint 1 at 0x49a365: file /workdir/tmp/go-build125523894/main.go, line 19.
        
    runtime-gdb_test.go:278: info goroutines failed: 
FAIL
FAIL	runtime	32.606s

bcmills · 2020-05-14T03:06:53Z

Attempting to bisect now, but I'm not hopeful because of the existing flakiness of the test (#24616).

bcmills · 2020-05-14T15:56:17Z

I'm unable to reproduce the failures at all on my local workstation using GDB 8.3.1 and go test -count=1024 '-run=TestGdb.*' -timeout=2h runtime.

~/go-review$ uname -srv
Linux 5.2.17-1rodete3-amd64 #1 SMP Debian 5.2.17-1rodete3 (2019-10-21 > 2018)

~/go-review$ gdb --version
GNU gdb (Debian 8.3.1-1) 8.3.1

~/go-review$ python --version
Python 2.7.18rc1

I'm not sure how to proceed from here.

toothrot · 2020-05-14T18:13:25Z

/cc @ianlancetaylor

bcmills · 2020-05-14T19:24:52Z

I've gotten exactly one local repro so far. I'm not sure what conditions are required to trigger it — system load seems to be a factor.

However, the only repro I have gotten was without a test filter, which is even more difficult to use for bisection due to the presence of other flaky runtime tests (such as #37201, which occurred on the next attempt).

~/go-review$ go test -short -failfast -count=256 -timeout=2h runtime
--- FAIL: TestGdbConst (0.55s)
    runtime-gdb_test.go:71: gdb version 8.3
    runtime-gdb_test.go:549: output "Loading Go Runtime support.\nBreakpoint 1 at 0x4667e0: file /tmp/go-build364032258/main.go, line 8.\n"
    runtime-gdb_test.go:552: output mismatch
FAIL
FAIL    runtime 10.528s
FAIL

josharian · 2020-05-14T20:06:21Z

If this is due to EINTR, might pummeling the process with signals help?

bcmills · 2020-05-14T20:11:44Z

An interesting idea, but it doesn't seem to help.

toothrot · 2020-05-21T18:28:16Z

@bcmills Just a friendly ping to check on the status of this as we work through our beta blocking issues.

bcmills · 2020-05-21T18:33:22Z

I don't know how to fix this and the tests are so nondeterministic that I can't figure out how to even bisect it.

I plan to file a proposal (later today?) to relocate these tests so that they are not run as part of go test runtime and all.bash, unless someone is going to step in to fix these otherwise-apparently-unmaintained tests.

mundaym · 2020-05-25T19:51:02Z

CL 232862 does seem like a likely candidate based on the CLs that landed around the time this issue started. Perhaps we should temporarily revert it and see if the problem disappears given that bisecting appears fruitless?

gopherbot · 2020-05-26T20:00:51Z

Change https://golang.org/cl/235282 mentions this issue: runtime: check gdb exit status and log output

prattmic · 2020-05-26T21:52:47Z

After ~10 runs of go test runtime, I managed to repro this with golang.org/cl/235282 on an s390 gomote:

$ gomote run user-mpratt-linux-s390x-ibm-0 go/bin/go test -count=1 runtime                                                                   
--- FAIL: TestGdbBacktrace (2.29s)
    runtime-gdb_test.go:71: gdb version 8.2
    runtime-gdb_test.go:395: gdb output:
        Loading Go Runtime support.
        Breakpoint 1 at 0x6e404: file <autogenerated>, line 1.
    runtime-gdb_test.go:397: gdb exited with error: signal: segmentation fault (core dumped)                                                 
FAIL       
FAIL    runtime 226.306s
FAIL       
Error running run: exit status 1

So GDB is segfaulting... I'll see if I can manage to find the core file.

prattmic · 2020-05-26T22:00:51Z

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000003ffb9c009f8 in fileno_unlocked () from /lib64/libc.so.6

Thread 1 (Thread 0x3ffbade6bb0 (LWP 1048)):
#0  0x000003ffb9c009f8 in fileno_unlocked () from /lib64/libc.so.6
#1  0x000003ffbacb71be in rl_resize_terminal () from /lib64/libreadline.so.7
#2  0x000002aa08763246 in tui_async_resize_screen(void*) ()
#3  0x000002aa088dca18 in invoke_async_signal_handlers() ()
#4  0x000002aa088dd57a in gdb_do_one_event() ()
#5  0x000002aa08a23e04 in wait_sync_command_done() ()
#6  0x000002aa08a23ffc in execute_command(char const*, int) ()
#7  0x000002aa08947f26 in catch_command_errors(void (*)(char const*, int), char const*, int) ()
#8  0x000002aa0894935e in gdb_main(captured_main_args*) ()
#9  0x000002aa08686f06 in main ()

It seems like this must be some kind of GDB bug, though perhaps one we a tickling somehow.

cherrymui · 2020-05-26T22:36:49Z

Looking at GDB source code, linux_test_for_tracefork: unexpected result from waitpid is GDB internal. It is part of GDB checking ptrace features, and the child is also part of GDB itself, not user program. It doesn't look like anything related to Go.

Just to be sure, is the GDB binary on the builder changed recently, like upgraded from one version to another?

cherrymui · 2020-05-26T22:38:37Z

GDB 7.12 is released on 2016. Should we try some newer version of GDB on the builders?

prattmic · 2020-05-26T22:49:16Z

FWIW, the crashes on the s390 builder are on GDB 8.2, which is a bit newer (2018).

Digging into the #39021 (comment) crash a bit more:

rl_resize_terminal will crash if rl_initialize is never called (i.e., initialize readline).
gdb -batch -ex 'p 1' never calls rl_initialize (at least with HEAD gdb).
It does call tui_interp::init, which installs a SIGWINCH signal handler for resizing.

Thus is seems it should be possible to trigger this crash by sending a SIGWINCH to a batch mode GDB, though I haven't managed to do so.

bcmills · 2020-05-26T23:47:35Z

@chlunde noted the same SIGWINCH bug in an external conversation. It's not clear to me whether that bug is related to this test failure.

All GDB tests currently ignore non-zero exit statuses. When tests flakes, we don't even know if GDB exited successfully or not. Add checks for non-zero exits, which are not expected. Furthermore, always log the output from GDB. The tests are currently inconsistent about whether they always log, or only on error. Updates #39021 Change-Id: I7af1d795fc2fdf58093cb2731d616d4aa44e9996 Reviewed-on: https://go-review.googlesource.com/c/go/+/235282 Run-TryBot: Bryan C. Mills <bcmills@google.com> Reviewed-by: Bryan C. Mills <bcmills@google.com> Reviewed-by: Ian Lance Taylor <iant@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>

prattmic · 2020-05-27T18:34:54Z

I've dumped all of the core files on the s390 builder, and they all have the same stack trace as #39021 (comment)

$ gomote run user-mpratt-linux-s390x-ibm-0 /usr/bin/coredumpctl list gdb | tr -s ' ' | cut -f 5 -d ' ' | xargs -n 1 gomote run user-mpratt-linux-s390x-ibm-0 /usr/bin/coredumpctl info | c++filt > /tmp/stacks.txt

Though it remains to be seen if all of the GDB failures are SEGVs with core files. Now that https://golang.org/cl/235282 is in we should be getting more information on new failures.

prattmic · 2020-05-27T20:24:04Z

Two new segfaults:

2020-05-27T19:54:12-0d20a49/linux-amd64-staticlockranking
2020-05-27T19:54:12-0d20a49/linux-mips64le-mengzhuo

This is trivially reproducible with my system GDB with:

package main

import (
        "fmt"
        "os/exec"
        "syscall"
        "time"
)

func main() {
        cmd := exec.Command("gdb", "-nx", "-batch", "-ex", "run", "--args", "sleep", "60")
        go func() {
                // XXX: This isn't safe!
                time.Sleep(1*time.Second)
                fmt.Println("Sending SIGWINCH...")
                syscall.Kill(cmd.Process.Pid, syscall.SIGWINCH)
        }()
        got, err := cmd.CombinedOutput()
        fmt.Printf("output: %q, err: %v\n", got, err)
}

$ ./gdb
Sending SIGWINCH...
output: "", err: signal: segmentation fault

I'll see if I can report this upstream.

My guess is that something has changed recently that is causing SIGWINCH to get sent to processes on the builders. I'm not sure what that would be though; AFAIK it is usually just desktop environments.

EDIT: CombinedOutput sets stdin to /dev/null and stdout to a pipe, so the bash equivalent of this is: gdb -nx -batch -ex 'run' --args sleep 60 </dev/null 2>&1 | cat. This also dies when signalled. Though at HEAD it dies with SIGABRT because they've added a SIGSEGV handler that calls abort.

prattmic · 2020-05-27T21:11:24Z

Upstream bug: https://sourceware.org/bugzilla/show_bug.cgi?id=26056

prattmic · 2020-05-27T21:46:08Z

Thus far, the only true workaround I've found for this issue is to set GDB's stdin to /dev/tty. However, that's pretty ugly w.r.t. possible interactions with actual terminals. Not to mention it is probably annoying to write such that it works on all platforms (no idea if Darwin, Windows, etc have /dev/tty).

On the other hand, why are we getting SIGWINCH in the first place? This test sends SIGWINCH to the entire process group, so if it runs in parallel to the GDB tests and ends up in the same process group (not sure about that?), then it could be signaling GDB. It is a pretty old test though, and this seems to be a new issue.

prattmic · 2020-05-27T21:52:39Z

Aha! I should read kill(2) more closely: "If pid equals 0, then sig is sent to every process in the process group of the calling process."

The original blamed http://golang.org/cl/232862 has a new test with:

	// Send ourselves SIGWINCH regularly.
	go func() {
		for range time.Tick(100 * time.Microsecond) {
			syscall.Kill(0, syscall.SIGWINCH)
		}
	}()

So that is presumably the source of the signals. I'll send a workaround.

ianlancetaylor · 2020-05-27T22:08:44Z

Argh, sorry about that, don't know what I was thinking.

gopherbot · 2020-05-28T14:36:45Z

Change https://golang.org/cl/235557 mentions this issue: runtime: start GDB in a new process group

cherrymui · 2020-05-29T19:37:01Z

Could we run tests that sending signals to the process group not in parallel with other tests?

bcmills added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label May 12, 2020

bcmills added this to the Backlog milestone May 12, 2020

bcmills added the release-blocker label May 13, 2020

bcmills modified the milestones: Backlog, Go1.15 May 13, 2020

bcmills mentioned this issue May 21, 2020

runtime: move tests for runtime-gdb.py to a separate package and do not run them during all.bash #39204

Open

tklauser mentioned this issue May 23, 2020

runtime: TestGdbBacktrace failures with output truncated after "Breakpoint 1" line #39228

Closed

prattmic self-assigned this May 27, 2020

prattmic added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels May 27, 2020

cherrymui mentioned this issue May 29, 2020

runtime: TestCgoPprof{,PIE} is flaky #37201

Closed

gopherbot closed this as completed in 0aed2a4 May 29, 2020

golang locked and limited conversation to collaborators May 29, 2021

gopherbot added the FrozenDueToAge label May 29, 2021

rsc unassigned prattmic Jun 23, 2022

prattmic self-assigned this Jun 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: "unexpected result from waitpid" in TestGdbPython #39021

runtime: "unexpected result from waitpid" in TestGdbPython #39021

bcmills commented May 12, 2020

danscales commented May 12, 2020

bcmills commented May 13, 2020

bcmills commented May 13, 2020

bcmills commented May 13, 2020

bcmills commented May 14, 2020

bcmills commented May 14, 2020

toothrot commented May 14, 2020

bcmills commented May 14, 2020

josharian commented May 14, 2020

bcmills commented May 14, 2020

toothrot commented May 21, 2020

bcmills commented May 21, 2020 •

edited

Loading

mundaym commented May 25, 2020

gopherbot commented May 26, 2020

prattmic commented May 26, 2020

prattmic commented May 26, 2020 •

edited

Loading

cherrymui commented May 26, 2020

cherrymui commented May 26, 2020

prattmic commented May 26, 2020 •

edited

Loading

bcmills commented May 26, 2020

prattmic commented May 27, 2020

prattmic commented May 27, 2020 •

edited

Loading

prattmic commented May 27, 2020

prattmic commented May 27, 2020

prattmic commented May 27, 2020

ianlancetaylor commented May 27, 2020

gopherbot commented May 28, 2020

cherrymui commented May 29, 2020

runtime: "unexpected result from waitpid" in TestGdbPython #39021

runtime: "unexpected result from waitpid" in TestGdbPython #39021

Comments

bcmills commented May 12, 2020

danscales commented May 12, 2020

bcmills commented May 13, 2020

bcmills commented May 13, 2020

bcmills commented May 13, 2020

bcmills commented May 14, 2020

bcmills commented May 14, 2020

toothrot commented May 14, 2020

bcmills commented May 14, 2020

josharian commented May 14, 2020

bcmills commented May 14, 2020

toothrot commented May 21, 2020

bcmills commented May 21, 2020 • edited Loading

mundaym commented May 25, 2020

gopherbot commented May 26, 2020

prattmic commented May 26, 2020

prattmic commented May 26, 2020 • edited Loading

cherrymui commented May 26, 2020

cherrymui commented May 26, 2020

prattmic commented May 26, 2020 • edited Loading

bcmills commented May 26, 2020

prattmic commented May 27, 2020

prattmic commented May 27, 2020 • edited Loading

prattmic commented May 27, 2020

prattmic commented May 27, 2020

prattmic commented May 27, 2020

ianlancetaylor commented May 27, 2020

gopherbot commented May 28, 2020

cherrymui commented May 29, 2020

bcmills commented May 21, 2020 •

edited

Loading

prattmic commented May 26, 2020 •

edited

Loading

prattmic commented May 26, 2020 •

edited

Loading

prattmic commented May 27, 2020 •

edited

Loading