os/signal: TestStop flaky on DragonFly #25092

timdarbydotnet · 2018-04-25T23:12:59Z

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (`go version`)?

1.10

Does this issue reproduce with the latest release?

yes

What operating system and processor architecture are you using (`go env`)?

DragonflyBSD amd64

The builder sporadically fails with this message:
ok net/url 0.021s
ok os 2.430s
ok os/exec 1.919s
--- FAIL: TestStop (1.98s)
signal_test.go:32: timeout waiting for window size changes
FAIL
FAIL os/signal 10.420s
ok os/user 0.009s
ok path 0.008s
2018/04/25 07:54:09 Failed: exit status 1

If this is indicative of an OS bug, I need to know what this test is doing so I can report it to the OS team.

The text was updated successfully, but these errors were encountered:

ianlancetaylor · 2018-04-25T23:28:10Z

According to greplogs the failures are all quite new, although the test has been there since 2013.

2018-04-21T20:20:37-20c98dc/dragonfly-amd64
2018-04-22T17:30:54-15095be/dragonfly-amd64
2018-04-23T18:18:01-0cd0dc9/dragonfly-amd64
2018-04-24T18:58:45-5d4267e/dragonfly-amd64
2018-04-24T21:49:40-011f6c5/dragonfly-amd64
2018-04-25T15:24:07-01a7487/dragonfly-amd64
2018-04-25T16:06:54-25813f9/dragonfly-amd64
2018-04-25T17:08:07-f2df0ec/dragonfly-amd64
2018-04-25T20:19:19-9e0e698/dragonfly-amd64
2018-04-25T20:22:06-932794c/dragonfly-amd64
2018-04-25T21:34:15-c5f0104/dragonfly-amd64

ianlancetaylor · 2018-04-25T23:30:01Z

findflakes says

First observed 20c98dc 21 Apr 20:20 2018 (54 commits ago)
Last observed  c5f0104 25 Apr 21:34 2018 (0 commits ago)
100% chance failure is still happening
19% failure probability (11 of 55 commits)
Likely culprits:
   19% 20c98dc cmd/link: skip TestRuntimeTypeAttr to fix build
   15% 0a129db misc/cgo/testcshared: use file descriptor 30 for TestUnexportedSymbols
   12% dfb1b69 os/signal: add func Ignored(sig Signal) bool
   10% 37dd7cd runtime: use sys.PtrSize in growslice
    8% 566e3e0 cmd/compile: avoid runtime call during switch string(byteslice)
    7% f6ca6ed net: document caveats for (*syscall.RawConn).Write on Windows
    5% d487488 cmd/internal/obj/x86: disallow PC/FP/SB scaled index
    4% dd71e3f cmd/compile: refactor how declarations are imported
    4% 514018c cmd/internal/obj/x86: ifelse->switch, named const for cap=6
    3% 21fa148 cmd/internal/obj/x86: add named consts for Prog.Back flags
No known past failures

ianlancetaylor · 2018-04-25T23:42:40Z

Presumably the new failures are somehow due to https://golang.org/cl/108376, which added two new tests that run before TestStop. The new test TestIgnored does fiddle with SIGWINCH. It resets the handling of the signal, but perhaps something goes wrong with that somewhere. It is definitely odd that we're only seeing a problem on DragonFly, but it's possible that there is some bug in the DragonFly support in the runtime package.

ianlancetaylor · 2018-04-26T00:40:04Z

You can probably make this fail 1 out of 10 times or so by running

go test -c os/signal
./signal.test -test.run='TestStop|TestDetectNohup'

If that works for you--that is, if the test fails--could you attach the DragonFly equivalent of strace -f output for a failed run, so we can see what system calls the test is making? Thanks.

bcmills · 2018-04-26T04:47:59Z

Possibly related to the race in #20748?

timdarbydotnet · 2018-05-25T15:19:01Z

@ianlancetaylor I tried to run the test you suggested, but where is "signal.test"? Does something generate that?

bcmills · 2018-05-25T15:34:53Z

where is "signal.test"? Does something generate that?

signal.test is the output generated by go test -c os/signal. I believe it is written to your current working directory.
(See https://golang.org/cmd/go/#hdr-Test_packages for more detail.)

timdarbydotnet · 2018-05-25T16:02:41Z

@bcmills I forgot the -c option, thanks. Attached is the ktrace from a TestStop failure.
ktrace.zip

timdarbydotnet · 2018-05-25T16:14:19Z

Interesting, I can't get this failure to occur if I run this test on a single CPU VM, but it readily happens on a dual CPU VM.

timdarbydotnet · 2018-05-26T19:38:28Z

Adding human-readable format of ktrace.
ktrace_dump.txt

timdarbydotnet · 2018-05-31T15:56:38Z

The builder is running dragonfly release 5.2.0, but I've confirmed that the flakiness in this test also happens on the most recent development branch. I can't recall for sure, but the appearance of this failure might very well coincide with switching the builder from a single CPU to a dual CPU VM a month or so ago.

ianlancetaylor · 2018-05-31T19:30:43Z

Thanks for the ktrace output. Unfortunately I don't see anything unusual in it. The SIGWINCH signal is being delivered twice, as expected. The first one is being ignored, as expected. For some reason the second one is not being sent on the channel.

timdarbydotnet · 2018-05-31T19:35:54Z

Thanks, if you can give me a description of what this test does, I can run it by the dev team.

ianlancetaylor · 2018-05-31T19:51:35Z

I'm not sure that description will help, as most of the activity is in the Go signal handler.

That said, the failing test is TestStop in os/signal/signal_test.go. It raises SIGWINCH signal which will be caught and discarded. Then it calls signal.Notify to ask for notifications of SIGWINCH signals. Then it raises SIGWINCH again. The signal is caught again and, this time, the Go runtime should set a bit in sig.mask (in runtime/sigqueue.go) and wake up the goroutine sleeping in signal_recv (same file). The latter goroutine should observe the signal in sig.mask and send a notification on the channel. The test goroutine should receive that notification and carry on. The ktrace shows that the second SIGWINCH is delivered and caught. Something after that is going wrong, but I don't know what. As you can there are several steps involved.

timdarbydotnet · 2018-06-01T22:13:12Z

I played around some with signal_test.go and discovered that if I increase the timeout in waitSig to 1900 * time.Millisecond, then I can no longer get it to fail by repeatedly running signal.test.

timdarbydotnet · 2018-06-02T00:02:08Z

I know what's happening. See this commit:
http://gitweb.dragonflybsd.org/dragonfly.git/commit/afd7f1247cb20d5b03f8ccddb470de6a55afc530

Note that the default cap on the timeout for umtx_sleep() is 2 seconds and notice the time in the FAIL: TestStop message - 1.99s. If I increase sysctl kern.umtx_timeout_max to 3 seconds, then the FAIL message time changes to 2.99s. So, the question is what's causing umtx_sleep to hit its max timeout.

ianlancetaylor · 2018-06-02T00:09:03Z

Thanks for the tip. I can recreate the problem. As far as I can tell, there is sometimes an unreasonably long period of time between a call to umtx_wakeup and the time that the thread that has called umtx_sleep starts running. So I agree that this seems clearly related to the kernel change.

I see that the kernel changes discusses fork. I can see the problem consistently when I run both TestDetectNohup and TestStop, but I do not see the problem when I run only TestStop. One interesting thing about TestDetectNohup is that it uses the os/exec package to run another program, which of course uses fork.

The sequence is

signal_recv calls umtx_sleep
TestDetectNoHup calls fork, twice
TestStop triggers a signal
the signal handler calls umtx_wakeup
occasionally the umtx_sleep called by signal_recv fails to wake up immediately

So if fork is somehow disassociating the umtx_wakeup from the previous umtx_sleep, that could cause the problem we are seeing.

Both the umtx_wakeup and umtx_sleep calls are using the address &sig.note, where sig is a package-scope variable defined in runtime/sigqueue.go.

timdarbydotnet · 2018-06-02T14:54:41Z

If either fork is executed in TestDetectNohup, then the test is flaky. When I remove both, it passes every time.

ianlancetaylor · 2018-06-05T05:06:21Z

It sounds like you've demonstrated that this is a bug in the DragonFly kernel, so I guess we should just skip the test on DragonFly.

timdarbydotnet · 2018-06-05T05:14:33Z

I'm awaiting clarification from the dev team

timdarbydotnet · 2018-06-05T16:14:16Z

Dragonfly has found the issue and will fix the timing window in the kernel. I'll be testing a patch soon.

ianlancetaylor · 2018-06-05T16:51:01Z

Great, thanks!

timdarbydotnet · 2018-06-07T05:04:10Z

The fix has been committed to the master branch and I've cherry picked it for the builder:
http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/efbd013239d66d591bfbc71ed58d36141e70b76c
http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/6481baf48ca2078fc4844f9814bc9aeda81a8f5d

ianlancetaylor · 2018-06-07T17:40:20Z

@tdfbsd Thanks!

ianlancetaylor changed the title ~~builder: dragonfly~~ os/signal: TestStop flaky on DragonFly Apr 25, 2018

ianlancetaylor added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Apr 25, 2018

ianlancetaylor added this to the Go1.11 milestone Apr 25, 2018

ianlancetaylor added the release-blocker label Apr 25, 2018

ianlancetaylor added WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker labels Jun 5, 2018

timdarbydotnet closed this as completed Jun 7, 2018

golang locked and limited conversation to collaborators Jun 7, 2019

gopherbot added the FrozenDueToAge label Jun 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

os/signal: TestStop flaky on DragonFly #25092

os/signal: TestStop flaky on DragonFly #25092

timdarbydotnet commented Apr 25, 2018

ianlancetaylor commented Apr 25, 2018

ianlancetaylor commented Apr 25, 2018

ianlancetaylor commented Apr 25, 2018

ianlancetaylor commented Apr 26, 2018

bcmills commented Apr 26, 2018

timdarbydotnet commented May 25, 2018

bcmills commented May 25, 2018

timdarbydotnet commented May 25, 2018

timdarbydotnet commented May 25, 2018

timdarbydotnet commented May 26, 2018

timdarbydotnet commented May 31, 2018

ianlancetaylor commented May 31, 2018

timdarbydotnet commented May 31, 2018

ianlancetaylor commented May 31, 2018

timdarbydotnet commented Jun 1, 2018

timdarbydotnet commented Jun 2, 2018

ianlancetaylor commented Jun 2, 2018

timdarbydotnet commented Jun 2, 2018

ianlancetaylor commented Jun 5, 2018

timdarbydotnet commented Jun 5, 2018

timdarbydotnet commented Jun 5, 2018

ianlancetaylor commented Jun 5, 2018

timdarbydotnet commented Jun 7, 2018

ianlancetaylor commented Jun 7, 2018

os/signal: TestStop flaky on DragonFly #25092

os/signal: TestStop flaky on DragonFly #25092

Comments

timdarbydotnet commented Apr 25, 2018

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

ianlancetaylor commented Apr 25, 2018

ianlancetaylor commented Apr 25, 2018

ianlancetaylor commented Apr 25, 2018

ianlancetaylor commented Apr 26, 2018

bcmills commented Apr 26, 2018

timdarbydotnet commented May 25, 2018

bcmills commented May 25, 2018

timdarbydotnet commented May 25, 2018

timdarbydotnet commented May 25, 2018

timdarbydotnet commented May 26, 2018

timdarbydotnet commented May 31, 2018

ianlancetaylor commented May 31, 2018

timdarbydotnet commented May 31, 2018

ianlancetaylor commented May 31, 2018

timdarbydotnet commented Jun 1, 2018

timdarbydotnet commented Jun 2, 2018

ianlancetaylor commented Jun 2, 2018

timdarbydotnet commented Jun 2, 2018

ianlancetaylor commented Jun 5, 2018

timdarbydotnet commented Jun 5, 2018

timdarbydotnet commented Jun 5, 2018

ianlancetaylor commented Jun 5, 2018

timdarbydotnet commented Jun 7, 2018

ianlancetaylor commented Jun 7, 2018

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?