Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: signal race condition #14571

Closed
Tasssadar opened this issue Feb 29, 2016 · 7 comments
Closed

runtime: signal race condition #14571

Tasssadar opened this issue Feb 29, 2016 · 7 comments
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@Tasssadar
Copy link
Contributor

1. What version of Go are you using (go version)?
2. What operating system and processor architecture are you using (go env)?
go version go1.6 linux/amd64

3. What did you do?
I have summoner process with large number of workers. When disconnected from the server, worker will exit its main loop and call signal.Stop() on the SIGINT it had previously set-up with signal.Notify(). At the same time, summoner will attempt to kill it with proc.Signal(os.Interrupt).

Sometimes, the interrupt signal gets lost - won't arrive to the channel and will not crash the program either. Below is an example I can replicate the race with, run it over and over and you will see some "signal missed" soon:

while true; do go run signaltest.go; done
package main

import (
    "os/signal"
    "os"
    "fmt"
    "time"
)

func connection(disconnected <-chan bool) bool {
    in := make(chan os.Signal, 2)
    signal.Notify(in, os.Interrupt)
    defer signal.Stop(in)

    for {
        select {
        case <-in:
            return true
        case <-disconnected:
            goto check_signal
        }
    }

check_signal:
    signal.Stop(in)

    select {
    case <- in:
        return true
    default:
        return false
    }
}

func killer(disconnected chan<- bool) {
    proc, _ := os.FindProcess(os.Getpid())
    time.Sleep(100*time.Millisecond)
    disconnected <- true
    proc.Signal(os.Interrupt)
}

func main() {
    disconnected := make(chan bool)

    for i := 0; true; i++ {
        fmt.Printf("%d: start\n", i)

        go killer(disconnected)

        if connection(disconnected) {
            fmt.Printf("  %d: got signal\n", i)
        } else {
            fmt.Printf("  %d: missed\n", i)
        }
        time.Sleep(500*time.Millisecond)
    }
}

4. What did you expect to see?
I expected the signal not to get lost. The documentation only states that "When Stop returns, it is guaranteed that c will receive no more signals." , so I'm not sure whether it is the right expectation, if it isn't, then the fix is fairly easy - just keep the signal channel registered the whole time.

5. What did you see instead?
Some of the signals get lost.

@ianlancetaylor ianlancetaylor changed the title Signal race condition runtime: signal race condition Feb 29, 2016
@ianlancetaylor ianlancetaylor added this to the Go1.7 milestone Feb 29, 2016
@ianlancetaylor ianlancetaylor self-assigned this Feb 29, 2016
@mdempsky mdempsky modified the milestones: Go1.8, Go1.7 May 19, 2016
@mdempsky
Copy link
Contributor

It seems like the question here is whether os/signal.Stop is guaranteed to be an atomic operation. I.e., if we call Stop(c) and c was the last channel registered for a particular signal, do we guarantee that concurrently receiving that signal will always either send to c or trigger the default signal logic?

My 2c is we should. If we call disableSignal in os/signal.Stop, then we should process any queued signals before deregistering c and returning.

@ianlancetaylor
Copy link
Contributor

After looking at the runtime and os/signal code for a few minutes I can't figure out what we could change. I think we are doing the right thing.

Unix signals are inherently racy. Sending a signal to a process sets a bit. When the process checks that bit, it takes some action. Sending two signals to a process simultaneously just sets that bit twice. If the process doesn't check the bit between the two signal sends, then it will only receive one signal.

I'm not completely sure but I suspect that is what is happening with this program.

@mdempsky
Copy link
Contributor

@ianlancetaylor The race here isn't about receiving two signals and losing one of them. The problem is that if os/signal.Stop grabs handlers.Lock, and then we receive a SIGINT before os/signal.Stop is able to call disableSignal(SIGINT), then we've accepted that signal instead of letting the default OS behavior exuecute. But because we're holding handlers.Lock (and possibly have already deregistered c), we prevent the os/signal.loop goroutine from calling os/signal.process to dispatch the signal to c.

@mdempsky
Copy link
Contributor

Put differently:

  1. The default behavior for SIGINT is for the program to exit.
  2. Calling signal.Notify(c, syscall.SIGINT) changes the program behavior to send a value to c instead of exiting when it receives SIGINT.
  3. Calling signal.Stop(c) causes the behavior to change back to exiting when it receives SIGINT.

(Assuming there are no other relevant signal-impacting calls.)

The problem is the "change back to exiting" is not atomic. In particular, we're logically deregistering c from receiving SIGINT events before we notify the kernel to stop sending them to us, which @tassadar's test program demonstrates via the "missed signal" messages.

Their test program concurrently sends itself SIGINT while calling signal.Stop(c). If SIGINT arrived first, we expect to see "got signal". If SIGINT arrived after, we expect the process to exit (due to SIGINT's default behavior). But occasionally their test program sees neither of these happen (i.e., "missing signal").

@ianlancetaylor
Copy link
Contributor

Thanks for the explanation. Given the current code, it sounds like Stop and process have to coordinate. Also, I suppose, signal_disable needs to call sigdisable before changing sig.wanted.

@quentinmit quentinmit added the NeedsFix The path to resolution is known, but the work has not been done. label Oct 11, 2016
@rsc rsc modified the milestones: Go1.9, Go1.8 Nov 11, 2016
@aclements
Copy link
Member

Ping @ianlancetaylor

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/46003 mentions this issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

7 participants