os/signal: TestAtomicStop failing on Illumos #35085

bradfitz · 2019-10-22T20:46:30Z

The newly revived Illumos builder (run by @jclulow now, on different host/OS probably) is now failing with:

https://build.golang.org/log/47c0329f33a7c7bd68dc98c35021160b03a3c6a5

ok  	os	0.809s
ok  	os/exec	0.869s
--- FAIL: TestAtomicStop (2.15s)
    signal_test.go:400: iteration 1: output lost signal on tries: 2
    signal_test.go:408: iteration 1: lost signal
FAIL
FAIL	os/signal	6.493s
ok  	os/user	0.020s
ok  	path	0.025s
FAIL
2019/10/22 14:36:24 Failed: exit status 1

/cc @ianlancetaylor @bcmills

The text was updated successfully, but these errors were encountered:

jclulow · 2019-10-22T21:30:09Z

I was just looking at this. It doesn't, as far as I can tell, always fail -- just most of the time. Having watched the buildlet a bit in the last few days while it was getting going, the load averages in the zone get up pretty high.

I think this is because the -p flag for job parallelism defaults to the number of CPUs, so the test runner starts that many tests in parallel. Because this is a zone (i.e., a kind of container) you can actually see (and schedule on!) every CPU in the host; there are 48 in all. Though all of the CPUs are visible, the zone is capped in the amount of CPU time it is able to use -- in effect, it's able to use 2 CPUs worth of CPU time per second.

If TestAtomicStop is timing sensitive, it's possible that when it fails it's running alongside rather a lot of other tests, and the ~24X over subscription might mean this test isn't on CPU to run soon enough to hit the timing window it's expecting. When I run the test suite on a standalone machine which doesn't have this property (i.e., a small system on my desk where I'm the only user) it seems to pass every time.

bradfitz · 2019-10-22T21:49:43Z

On Linux, runtime.NumCPU() looks at the number of actually usable CPUs.

Perhaps the fix here is just to fix the Illumos implementation of NumCPU to be aware of the zone's CPU limit.

bradfitz · 2019-10-22T22:04:07Z

In fact, I dealt with this just recently in https://go-review.googlesource.com/c/build/+/201637/7/cmd/rundockerbuildlet/rundockerbuildlet.go#162 where I had to restrict a Docker container's --cpuset-cpus value just to make runtime.NumCPU happy, even though it would've been nice to give the containers as much extra CPU as they wanted to fight over.

Perhaps we need a new knob that lets us control test parallelism without hard-coding it to mean NumCPU.

/cc @bcmills

jclulow · 2019-10-22T22:25:27Z

If we could set, say GO_BUILD_MAXJOBS, and have that override the -p default, I could definitely set that in the environment for the buildlet service to a value that makes sense for the specific memory and CPU caps in the particular zone.

bradfitz · 2019-10-22T22:33:43Z

@jclulow, you should probably still fix runtime.NumCPU meanwhile, though, if it's not too much work.

bcmills · 2019-10-23T13:56:19Z

The hard-coded time.After in this test is suspect:

go/src/os/signal/signal_test.go

Lines 449 to 456 in 606019c

    
           // At this point we should either die from SIGINT or 
        
           // get a notification on cs. If neither happens, we 
        
           // dropped the signal. It is given 2 seconds to 
        
           // deliver, as needed for gccgo on some loaded test systems. 
        
           select { 
        
           case <-cs: 
        
           case <-time.After(2 * time.Second):

See previously #29046.

bcmills · 2019-10-23T15:04:59Z

I don't see a compelling need to limit the delay here to 2 seconds, given that in the normal case the test will pass on every attempt.

laboger · 2019-10-23T15:08:06Z

FWIW, we see this failure intermittently on gccgo for both ppc64le & ppc64. The limit was put at 2s to hopefully fix that problem but it still happens. We don't see the error in golang.

bcmills · 2019-10-23T15:08:44Z

If we pull in the fix for #28135, then we could fairly easily plumb the test deadline down to atomicStopTestProgram and remove the hard-coded timeout entirely. I'm not sure whether that would address the problem for gccgo or make the failure mode more frustrating, though.

gopherbot · 2019-10-25T18:15:59Z

Change https://golang.org/cl/203499 mentions this issue: os/signal: derive TestAtomicStop timeout from overall test timeout

Previously, TestAtomicStop used a hard-coded 2-second timeout. That empirically is not long enough on certain builders. Rather than adjusting it to a different arbitrary value, use a slice of the overall timeout for the test binary. If everything is working, we won't block nearly that long anyway. Updates #35085 Change-Id: I7b789388e3152413395088088fc497419976cf5c Reviewed-on: https://go-review.googlesource.com/c/go/+/203499 Run-TryBot: Bryan C. Mills <bcmills@google.com> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>

bcmills · 2019-10-25T21:02:24Z

Still failing, but the failures take longer now:

--- FAIL: TestAtomicStop (30.66s)
    signal_test.go:420: iteration 4: output lost signal on tries: 3
    signal_test.go:428: iteration 4: lost signal
    signal_test.go:420: iteration 7: output lost signal on tries: 4
    signal_test.go:428: iteration 7: lost signal
FAIL
FAIL	os/signal	35.166s

Maybe an illumos kernel bug? (Is it dropping signals under load?)

Or is it just overloaded enough that the signals take longer than ~15s each to deliver? (Do we need to set a timeout scale factor on this builder to give it more time?)

ianlancetaylor · 2019-10-26T00:12:55Z

I doubt it is a timeout issue. But if it is, we don't need to be fancy about the timeouts. Just set the timeout to one minute.

jclulow · 2019-10-27T21:25:59Z

@jclulow, you should probably still fix runtime.NumCPU meanwhile, though, if it's not too much work.

I've opened #35199 for this.

jclulow · 2019-10-29T07:10:30Z

It would seem that #35199 hasn't improved things. I tried running the test suite manually on the builder, using make.bash and then run.bash and the tests appear to pass -- including os/signal.

I've set up some basic tracing of signal-related system calls to get a better idea of what happens next time one of these runs and fails on the builder.

jclulow · 2019-11-01T07:26:42Z

This is proving difficult to debug. The only time the test seems to fail is when run under the buildlet at the behest of the coordinator.

I have been trying, but I don't believe I can make it fail when run manually under all.bash or run.bash, or when using go tool dist test to run the os/signal test more directly, or when using curl to attempt to replicate (on an isolated buildlet) what I believe the coordinator is asking us to do. I also tried lifting the atomicStopTestProgram() routine out into a standalone program (trimming out the test harness-related bits) and I don't believe I've been able to observe that failing either, despite running it many thousands of times under different levels of system load.

I've tried using DTrace at various times to monitor the timing of signal-related system calls and delivery events, and it seems that the very slight probe effect is enough to make the situation worse -- but only in the buildlet context where the failures are pretty reliably happening now. Applying the tracing to standalone execution of the test cases doesn't seem to induce the failure there.

I'd appreciate any advice for how folks would normally debug what seems like a pretty tight race in the runtime like this one. Also, as noted in #35261 it seems like the build is currently broken as of the switch to the new timer infrastructure, which is making it a bit difficult to make forward progress.

ianlancetaylor · 2019-11-01T13:59:51Z

Sometimes tests like this can be made to fail locally by putting the local system under heavy load while running the test. I don't have any reliable mechanism, though.

I've looked at this test several times and I've never been able to find anything wrong with the code. It would be really great if you could figure this out. Thanks for looking at it.

bradfitz · 2019-11-01T14:46:51Z

If it matters, the buildlet (the parent process that runs go tool dist test XXXXX) does:

        signal.Notify(c, syscall.SIGINT, syscall.SIGTERM)

Would that affect the test?

ianlancetaylor · 2019-11-01T16:59:15Z

Thanks, but I can't think of any reason why that would matter. The test does its own signal.Notify call and that should be all that matters for it.

jclulow · 2019-11-04T00:12:42Z

I think fixing this is going to take a while. Can I submit a CL that skips it specifically on the current illumos buildlet so that we can at least see the general health of the rest of the test suite more clearly?

jclulow · 2019-11-04T02:15:43Z

OK, I'm pretty sure I know what this is now. The stage0 binary is being run by a simple shell script that just starts it over and over again. That shell script is using bash as the shell. From the SIGNALS section of bash(1):

When job control is not in effect, asynchronous commands ignore SIGINT and SIGQUIT in addition to these inherited handlers.

The buildlet is running under SMF and as with many service supervision systems there is no controlling terminal or job control in that context. The SMF service for the buildlet is operating with the contract lifetime, and expects a command to execute that will terminate once the service has been started. As stage0 does not daemonise itself, we've taken the simple option of running the script that starts stage0 in a loop as buildlet & -- the & makes it an asynchronous command in bash parlance.

Examining the running buildlet, we can see that indeed SIGINT is ignored:

[root@gobuild3 ~]# pgrep -fl buildlet.exe
99253 ./buildlet.exe --halt=false --reverse-type=host-illumos-amd64-jclulow --coordin
[root@gobuild3 ~]# pgrep buildlet.exe | xargs -t psig | grep ignored
psig 99253 
INT     ignored

Looking at exec(2) we see:

Signals set to the default action (SIG_DFL) in the calling process image are set to the default action in the new process image (see signal(3C)). Signals set to be ignored (SIG_IGN) by the calling process image are set to be ignored by the new process image.

I suspect that SIGINT being ignored by default is contrary to the expectations of at least this test, if not the Go signal handling machinery in general. Using a simple C program I can now reproduce the failure:

#include <stdlib.h>
#include <stdint.h>
#include <signal.h>
#include <unistd.h>
#include <err.h>


int
main(int argc, char *argv[])
{
        if (argc < 2) {
                errx(1, "usage: %s PROGRAM [ARGS...]", argv[0]);
        }

        if (argv[1][0] != '/') {
                errx(1, "PROGRAM must be fully qualified");
        }

        /*
         * Ignore SIGINT so that the new process also ignores it.
         */
        if (signal(SIGINT, SIG_IGN) == SIG_ERR) {
                err(1, "signal");
        }

        if (execv(argv[1], argv + 1) != 0) {
                err(1, "execv");
        }
}

$ cd /ws/go/master/src
$ PATH=/ws/go/master/bin:$PATH /ws/go/signals/ignore_sigint $PWD/run.bash -run go_test:os/signal 

##### Building packages and commands.

##### Testing packages.
--- FAIL: TestAtomicStop (16.04s)
    signal_test.go:420: iteration 6: output lost signal on tries: 1
    signal_test.go:428: iteration 6: lost signal
FAIL
FAIL    os/signal       20.574s
FAIL
2019/11/03 18:03:31 Failed: exit status 1

For now, I'm going to try inserting the opposite of this wrapper program into the buildlet script; i.e., resetting the disposition of SIGINT to SIG_DFL prior to calling execv().

jclulow · 2019-11-04T03:48:39Z

OK I've adjusted the SMF service to use the child (aka wait) duration for the service, and we're executing the stage0 binary directly. The wait duration seems a better fit for the way stage0 is intended to work anyway; from svc.startd(1M):

"Wait" model services are restarted whenever the child process associated
with the service exits. A child process that exits is not considered an
error for "wait" model services, and repeated failures do not lead to a
transition to maintenance state. However, a wait service which is
repeatedly exiting with an error that exceeds the default rate (5
failures/second) will be throttled back so that the service only restarts
once per second.

A few test runs have now had a pass from os/signal.

ianlancetaylor · 2019-11-05T00:04:51Z

I don't see why it should matter whether SIGINT is ignored when the process starts. The program run by TestAtomicStop (atomicStopTestProgram) calls Notify(cs, syscall.SIGINT), which should enable the signal even if it were previously ignored.

In general Go is prepared for SIGINT to be ignored when the program starts. This is checked explicitly in sigInstallGoHandler in runtime/signal_unix.go.

ianlancetaylor · 2019-11-05T00:09:08Z

It's fine to send a CL to skip the test on Illumos. Use testenv.SkipFlaky from the internal/testenv package.

jclulow · 2019-11-09T16:58:48Z

I'm pretty sure this problem occurs on Linux too when SIGINT is ignored, though it happens less often. I've lifted atomicStopTestProgram into a separate program:

package main

import (
        "fmt"
        "os"
        "os/signal"
        "sync"
        "syscall"
        "time"
)

// atomicStopTestProgram is run in a subprocess by TestAtomicStop.
// It tries to trigger a signal delivery race. This function should
// either catch a signal or die from it.
func atomicStopTestProgram() {
        status := 0
        const tries = 10

        timeout := 2 * time.Second

        pid := syscall.Getpid()
        printed := false
        for i := 0; i < tries; i++ {
                cs := make(chan os.Signal, 1)
                signal.Notify(cs, syscall.SIGINT)

                var wg sync.WaitGroup
                wg.Add(1)
                go func() {
                        defer wg.Done()
                        signal.Stop(cs)
                }()

                syscall.Kill(pid, syscall.SIGINT)

                // At this point we should either die from SIGINT or
                // get a notification on cs. If neither happens, we
                // dropped the signal. It is given 2 seconds to
                // deliver, as needed for gccgo on some loaded test systems.

                select {
                case <-cs:
                case <-time.After(timeout):
                        status = 1
                        if !printed {
                                fmt.Print("lost signal on tries:")
                                printed = true
                        }
                        fmt.Printf(" %d", i)
                }

                wg.Wait()
        }
        if printed {
                fmt.Print("\n")
        }

        os.Exit(status)
}

func main() {
        atomicStopTestProgram()
}

I created a simple test harness to run the program over and over and report when it emits the error message:

#!/bin/bash

dir=$(cd "$(dirname "$0")" && pwd)
echo $dir/three

fail=0
pass=0
lastprint=$SECONDS

trap 'exit 1' INT

while :; do
        x=$($dir/ignore_sigint $dir/three 2>&1)
        if [[ $x =~ 'lost signal on tries:' ]]; then
                printf '    %s\n' "$x"
                (( fail += 1 ))
        else
                (( pass += 1 ))
        fi

        if (( SECONDS != lastprint )); then
                printf '%s pass %-4d fail %-4d\n' "$(date -u +%FT%TZ)" \
                    "$pass" "$fail"
                pass=0
                fail=0
                lastprint=$SECONDS
        fi
done

The ignore_sigint program is the C program from earlier in the thread which ignores SIGINT before starting the nominated program. The binary three above is the atomicStopTestProgram program.

On an illumos system, this obviously fails pretty readily:

$ uname -a
SunOS newcastle 5.11 illumos-c5c712a873 i86pc i386 i86pc

$ ./harness 
/ws/go/signals/three
    lost signal on tries: 1
2019-11-09T16:52:08Z pass 5    fail 1   
    lost signal on tries: 4
2019-11-09T16:52:10Z pass 0    fail 1   
    lost signal on tries: 1
2019-11-09T16:52:12Z pass 16   fail 1   
    lost signal on tries: 2
2019-11-09T16:52:14Z pass 6    fail 1   
    lost signal on tries: 2
2019-11-09T16:52:16Z pass 14   fail 1   
    lost signal on tries: 2 5
2019-11-09T16:52:20Z pass 1    fail 1   
    lost signal on tries: 2
2019-11-09T16:52:22Z pass 1    fail 1   
^C

But it also fails (though less often) on a Linux system:

$ uname -a
Linux leftstation 5.0.0-31-generic #33-Ubuntu SMP Mon Sep 30 18:51:59 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

$ ./harness 
/var/tmp/signals/three
2019-11-09T16:49:23Z pass 127  fail 0   
2019-11-09T16:49:24Z pass 138  fail 0   
2019-11-09T16:49:25Z pass 139  fail 0   
2019-11-09T16:49:26Z pass 481  fail 0   
2019-11-09T16:49:27Z pass 665  fail 0   
2019-11-09T16:49:28Z pass 642  fail 0   
2019-11-09T16:49:29Z pass 667  fail 0   
2019-11-09T16:49:30Z pass 664  fail 0   
2019-11-09T16:49:31Z pass 669  fail 0   
2019-11-09T16:49:32Z pass 669  fail 0   
2019-11-09T16:49:33Z pass 645  fail 0   
2019-11-09T16:49:34Z pass 664  fail 0   
2019-11-09T16:49:35Z pass 668  fail 0   
2019-11-09T16:49:36Z pass 663  fail 0   
2019-11-09T16:49:37Z pass 670  fail 0   
2019-11-09T16:49:38Z pass 661  fail 0   
2019-11-09T16:49:39Z pass 667  fail 0   
2019-11-09T16:49:40Z pass 668  fail 0   
2019-11-09T16:49:41Z pass 664  fail 0   
2019-11-09T16:49:42Z pass 671  fail 0   
2019-11-09T16:49:43Z pass 663  fail 0   
2019-11-09T16:49:44Z pass 668  fail 0   
2019-11-09T16:49:45Z pass 669  fail 0   
2019-11-09T16:49:46Z pass 669  fail 0   
2019-11-09T16:49:47Z pass 663  fail 0   
2019-11-09T16:49:48Z pass 670  fail 0   
2019-11-09T16:49:49Z pass 669  fail 0   
2019-11-09T16:49:50Z pass 671  fail 0   
2019-11-09T16:49:51Z pass 650  fail 0   
2019-11-09T16:49:52Z pass 654  fail 0   
2019-11-09T16:49:53Z pass 653  fail 0   
2019-11-09T16:49:54Z pass 640  fail 0   
2019-11-09T16:49:55Z pass 556  fail 0   
2019-11-09T16:49:56Z pass 576  fail 0   
2019-11-09T16:49:57Z pass 535  fail 0   
    lost signal on tries: 2
2019-11-09T16:49:59Z pass 351  fail 1   
2019-11-09T16:50:00Z pass 122  fail 0   
2019-11-09T16:50:01Z pass 500  fail 0   
2019-11-09T16:50:02Z pass 501  fail 0   
2019-11-09T16:50:03Z pass 517  fail 0   
2019-11-09T16:50:04Z pass 626  fail 0   
2019-11-09T16:50:05Z pass 658  fail 0   
2019-11-09T16:50:06Z pass 657  fail 0   
2019-11-09T16:50:07Z pass 659  fail 0   
2019-11-09T16:50:08Z pass 641  fail 0   
2019-11-09T16:50:09Z pass 660  fail 0   
2019-11-09T16:50:10Z pass 623  fail 0   
    lost signal on tries: 1
2019-11-09T16:50:12Z pass 65   fail 1   
2019-11-09T16:50:13Z pass 316  fail 0   
2019-11-09T16:50:14Z pass 480  fail 0   
2019-11-09T16:50:15Z pass 550  fail 0   
2019-11-09T16:50:16Z pass 594  fail 0   
2019-11-09T16:50:17Z pass 578  fail 0   
2019-11-09T16:50:18Z pass 597  fail 0   
2019-11-09T16:50:19Z pass 588  fail 0   
2019-11-09T16:50:20Z pass 626  fail 0   
2019-11-09T16:50:21Z pass 640  fail 0   
2019-11-09T16:50:22Z pass 640  fail 0   
2019-11-09T16:50:23Z pass 603  fail 0   
2019-11-09T16:50:24Z pass 594  fail 0   
2019-11-09T16:50:25Z pass 631  fail 0   
2019-11-09T16:50:26Z pass 638  fail 0   
2019-11-09T16:50:27Z pass 630  fail 0   
2019-11-09T16:50:28Z pass 569  fail 0   
2019-11-09T16:50:29Z pass 634  fail 0   
2019-11-09T16:50:30Z pass 654  fail 0   
^C

ianlancetaylor · 2019-11-14T01:24:56Z

Ah, I see it now. My apologies. When SIGINT is ignored at program start, then when we close the channel we reset the SIGINT handler back to SIG_IGN. If we manage to do that before sending the signal, then it will be ignored.

ianlancetaylor · 2019-11-14T01:25:05Z

Thanks very much for digging into this.

gopherbot · 2019-11-14T01:34:21Z

Change https://golang.org/cl/207081 mentions this issue: os/signal: don't ignore SIGINT in TestAtomicStop child process

bradfitz added help wanted NeedsFix The path to resolution is known, but the work has not been done. OS-illumos labels Oct 22, 2019

bradfitz added this to the Go1.14 milestone Oct 22, 2019

jclulow mentioned this issue Oct 31, 2019

building on illumos gets stuck occasionally #35261

Closed

gopherbot closed this as completed in 3f21c23 Nov 14, 2019

golang locked and limited conversation to collaborators Nov 13, 2020

gopherbot added the FrozenDueToAge label Nov 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

os/signal: TestAtomicStop failing on Illumos #35085

os/signal: TestAtomicStop failing on Illumos #35085

bradfitz commented Oct 22, 2019

jclulow commented Oct 22, 2019

bradfitz commented Oct 22, 2019 •

edited

Loading

bradfitz commented Oct 22, 2019

jclulow commented Oct 22, 2019

bradfitz commented Oct 22, 2019

bcmills commented Oct 23, 2019

bcmills commented Oct 23, 2019

laboger commented Oct 23, 2019

bcmills commented Oct 23, 2019

gopherbot commented Oct 25, 2019

bcmills commented Oct 25, 2019

ianlancetaylor commented Oct 26, 2019

jclulow commented Oct 27, 2019

jclulow commented Oct 29, 2019

jclulow commented Nov 1, 2019

ianlancetaylor commented Nov 1, 2019

bradfitz commented Nov 1, 2019

ianlancetaylor commented Nov 1, 2019

jclulow commented Nov 4, 2019

jclulow commented Nov 4, 2019 •

edited

Loading

jclulow commented Nov 4, 2019

ianlancetaylor commented Nov 5, 2019

ianlancetaylor commented Nov 5, 2019

jclulow commented Nov 9, 2019

ianlancetaylor commented Nov 14, 2019

ianlancetaylor commented Nov 14, 2019

gopherbot commented Nov 14, 2019

os/signal: TestAtomicStop failing on Illumos #35085

os/signal: TestAtomicStop failing on Illumos #35085

Comments

bradfitz commented Oct 22, 2019

jclulow commented Oct 22, 2019

bradfitz commented Oct 22, 2019 • edited Loading

bradfitz commented Oct 22, 2019

jclulow commented Oct 22, 2019

bradfitz commented Oct 22, 2019

bcmills commented Oct 23, 2019

bcmills commented Oct 23, 2019

laboger commented Oct 23, 2019

bcmills commented Oct 23, 2019

gopherbot commented Oct 25, 2019

bcmills commented Oct 25, 2019

ianlancetaylor commented Oct 26, 2019

jclulow commented Oct 27, 2019

jclulow commented Oct 29, 2019

jclulow commented Nov 1, 2019

ianlancetaylor commented Nov 1, 2019

bradfitz commented Nov 1, 2019

ianlancetaylor commented Nov 1, 2019

jclulow commented Nov 4, 2019

jclulow commented Nov 4, 2019 • edited Loading

jclulow commented Nov 4, 2019

ianlancetaylor commented Nov 5, 2019

ianlancetaylor commented Nov 5, 2019

jclulow commented Nov 9, 2019

ianlancetaylor commented Nov 14, 2019

ianlancetaylor commented Nov 14, 2019

gopherbot commented Nov 14, 2019

bradfitz commented Oct 22, 2019 •

edited

Loading

jclulow commented Nov 4, 2019 •

edited

Loading