Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os: on unix Process.Kill() can kill the wrong process #13987

Open
ibukanov opened this issue Jan 17, 2016 · 23 comments
Open

os: on unix Process.Kill() can kill the wrong process #13987

ibukanov opened this issue Jan 17, 2016 · 23 comments

Comments

@ibukanov
Copy link

This is somewhat similar to the issue #9382. On Unix Process.Kill() just calls Process.Signal(Kill). As https://golang.org/src/os/exec_unix.go#L39 indicates, the Signal function invokes syscall.Kill(p.Pid) outside any lock after checking if the process still runs. Thus at the point when the signal is called the process could finish and Process.Wait() from another thread return. Thus OS is free to reuse the pid for another unrelated process. If this happens, Process.Kill() kills that process.

Due to this race it is impossible to write a correct platform-independent code in Go that kills the process if it does not terminate after a pause. I.e. the code fragment from #9382 is not correct on Unix:

func spawnAndKill(exePath string, counter int) error {
    cmd := exec.Command(exePath, fmt.Sprintf("%d", counter))
    err := cmd.Start()
    if err != nil {
        return err
    }
    go func() {
        time.Sleep(1000 * time.Millisecond)
        cmd.Process.Kill()
    }()
    cmd.Wait()
    return nil
}
@minux
Copy link
Member

minux commented Jan 17, 2016 via email

@ibukanov
Copy link
Author

@minux On Windows the above code is not racy and works correctly due to fixes that went into #9382. On Linux I can also implement the same semantics correctly via waiting for SIGCHLD and using non-blocking wait4 syscall. That allows to synchronize with kill attempts. However, this is non-portable.

What I would like to see is an option to write a portable Go code implementing the kill-if-not-finished-after-timeout pattern. If this requires new API, so be it, but at least document loudly that calling Process.Kill and Process.Wait from different threads at the same time could race and should be avoided in portable code.

@ianlancetaylor ianlancetaylor added this to the Unplanned milestone Jan 18, 2016
@minux
Copy link
Member

minux commented Jan 18, 2016 via email

@ianlancetaylor
Copy link
Contributor

I think we can do it without a race, and without an extra goroutine, if we add something to the runtime package to count SIGCHLD signals, and provide a way for a goroutine to retrieve the current SIGCHLD count and a way to sleep until the SIGCHLD counter is greater than N.

@ibukanov
Copy link
Author

@minux The point is precisely not to wait to timeout as typically the process finishes much faster. In my case the timeout is really for runaway cases when something bad happens or the system is too busy. I also cannot just start another go routine that does kill/wait as I need to get the process return status to decide if everything is OK.

@ianlancetaylor Note I do not suggest to change the current implementation. Rather I suggest to document loudly that one should avoid Wait()/Kill() race in portable code. Then add some new API that allows to implement the desired functionality. For example, a version of exec.Cmd.Wait that takes a channel and waits until the child finishes or there is input on the channel is enough to implement all my use cases.

@ianlancetaylor
Copy link
Contributor

If we can change the existing API so that it avoids a race, I think that is clearly better than introducing a new API.

@minux
Copy link
Member

minux commented Jan 20, 2016

I still think the user code should be able to avoid this race
as much as possible.

For example, something like this should be pretty safe
without any complications in runtime and it also have
the benefit that as soon as the process exits, the timeout
goroutine will exit soon, rather than wait until the timer
expires.

func spawnAndKill(exePath string, counter int) error {
cmd := exec.Command(exePath, fmt.Sprintf("%d", counter))
err := cmd.Start()
if err != nil {
return err
}
ch := make(chan bool)
go func() {
select {
case <-time.After(1000 * time.Millisecond):
cmd.Process.Kill()
case <-ch:
return
}
}()
cmd.Wait()
close(ch)
return nil
}

@ibukanov
Copy link
Author

The above code just minimizes the chances of race. It is still there. As it is impossible to make its probability arbitrary small, one should avoid it, say, in production code.

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/23967 mentions this issue.

gopherbot pushed a commit that referenced this issue Jun 10, 2016
On systems that support the POSIX.1-2008 waitid function, we can use it
to block until a wait will succeed. This avoids a possible race
condition: if a program calls p.Kill/p.Signal and p.Wait from two
different goroutines, then it is possible for the wait to complete just
before the signal is sent. In that case, it is possible that the system
will start a new process using the same PID between the wait and the
signal, causing the signal to be sent to the wrong process. The
Process.isdone field attempts to avoid that race, but there is a small
gap of time between when wait returns and isdone is set when the race
can occur.

This CL avoids that race by using waitid to wait until the process has
exited without actually collecting the PID. Then it sets isdone, then
waits for any active signals to complete, and only then collects the PID.

No test because any plausible test would require starting enough
processes to recycle all the process IDs.

Update #13987.
Update #16028.

Change-Id: Id2939431991d3b355dfb22f08793585fc0568ce8
Reviewed-on: https://go-review.googlesource.com/23967
Run-TryBot: Ian Lance Taylor <iant@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
@ianlancetaylor
Copy link
Contributor

I believe this race is now fixed on GNU/Linux.

Leaving the issue open as the fix will have to be implemented on other Unix systems. Systems that provide waitid (which should be most of them, since it is in POSIX.1-2008) need to implement that; it's slightly painful since it's not in the syscall package.

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/24021 mentions this issue.

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/24020 mentions this issue.

gopherbot pushed a commit that referenced this issue Jun 13, 2016
This change is a followup to https://go-review.googlesource.com/23967
for Darwin.

Updates #13987.
Updates #16028.

Change-Id: Ib1fb9f957fafd0f91da6fceea56620e29ad82b00
Reviewed-on: https://go-review.googlesource.com/24020
Reviewed-by: Ian Lance Taylor <iant@golang.org>
gopherbot pushed a commit that referenced this issue Jun 13, 2016
This change is a followup to https://go-review.googlesource.com/23967
for FreeBSD.

Updates #13987.
Updates #16028.

Change-Id: I0f0737372fce6df89d090fe9847305749b79eb4c
Reviewed-on: https://go-review.googlesource.com/24021
Reviewed-by: Ian Lance Taylor <iant@golang.org>
tklauser added a commit to tklauser/go that referenced this issue May 1, 2021
Follow CL 23967 and CL 24021 which did the same on linux and freebsd,
respectively.

Updates golang#13987
Updates golang#16028

Change-Id: Ia30ef8b5cffd8f9eb75c29ee5fe350dac2be6d44
@gopherbot
Copy link
Contributor

Change https://golang.org/cl/315279 mentions this issue: os, syscall: use wait6 to avoid wait/kill race on dragonfly

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/315281 mentions this issue: os, syscall: use wait6 to avoid wait/kill race on netbsd

gopherbot pushed a commit that referenced this issue May 2, 2021
Follow CL 23967 and CL 24021 which did the same on linux and freebsd,
respectively.

Updates #13987
Updates #16028

Change-Id: Ia30ef8b5cffd8f9eb75c29ee5fe350dac2be6d44
Reviewed-on: https://go-review.googlesource.com/c/go/+/315279
Trust: Tobias Klauser <tobias.klauser@gmail.com>
Run-TryBot: Tobias Klauser <tobias.klauser@gmail.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
@tklauser tklauser removed the OS-NetBSD label May 4, 2021
gopherbot pushed a commit that referenced this issue May 4, 2021
Follow CL 23967 and CL 24021 which did the same on linux and freebsd,
respectively.

Updates #13987
Updates #16028

Change-Id: I95b13d8ddde4cea1ef4fb7d655f1ad1a219d13aa
Reviewed-on: https://go-review.googlesource.com/c/go/+/315281
Trust: Tobias Klauser <tobias.klauser@gmail.com>
Run-TryBot: Tobias Klauser <tobias.klauser@gmail.com>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
@gopherbot
Copy link
Contributor

Change https://golang.org/cl/323489 mentions this issue: doc/go1.17: mention SYS_WAIT6/WEXITED on NetBSD

gopherbot pushed a commit that referenced this issue Jun 2, 2021
For #13987
For #16028
For #44513

Change-Id: I7a73446fcc80a01fa6de24eec1e5b993e543be37
Reviewed-on: https://go-review.googlesource.com/c/go/+/323489
Trust: Ian Lance Taylor <iant@golang.org>
Reviewed-by: Emmanuel Odeke <emmanuel@orijtech.com>
@gopherbot
Copy link
Contributor

Change https://golang.org/cl/354249 mentions this issue: os: don't use wait6 on netbsd

gopherbot pushed a commit that referenced this issue Oct 7, 2021
CL 315281 changed the os package use wait6 on netbsd. This seems to be
causing frequent test failures as reported in #48789. Revert that change
using wait6 on netbsd for now.

Updates #13987
Updates #16028
For #48789

Change-Id: Ieddffc65611c7f449971eaa8ed6f4299a5f742c2
Reviewed-on: https://go-review.googlesource.com/c/go/+/354249
Trust: Tobias Klauser <tobias.klauser@gmail.com>
Trust: Bryan C. Mills <bcmills@google.com>
Trust: Benny Siegert <bsiegert@gmail.com>
Run-TryBot: Tobias Klauser <tobias.klauser@gmail.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Bryan C. Mills <bcmills@google.com>
Reviewed-by: Benny Siegert <bsiegert@gmail.com>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/431855 mentions this issue: os: use wait6 to avoid wait/kill race on netbsd

gopherbot pushed a commit that referenced this issue Sep 19, 2022
Resend of CL 315281 which was partially reverted by CL 354249 after the
original CL was suspected to cause test failures as reported in #48789.
It seems that both wait4 and wait6 lead to that particular deadlock, so
let's use wait6. That way we at least don't hit #13987 on netbsd.

Updates #13987
For #48789
For #50138

Change-Id: Iadc4a771217b7e9e821502e89afa07036e0dcb6f
Reviewed-on: https://go-review.googlesource.com/c/go/+/431855
Reviewed-by: Benny Siegert <bsiegert@gmail.com>
Auto-Submit: Tobias Klauser <tobias.klauser@gmail.com>
Run-TryBot: Tobias Klauser <tobias.klauser@gmail.com>
Reviewed-by: Bryan Mills <bcmills@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/442478 mentions this issue: os: use the correct constant for P_PID on NetBSD

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/442575 mentions this issue: os: split wait6 syscall wrapper into per-platform files

gopherbot pushed a commit that referenced this issue Oct 12, 2022
Dragonfly and FreeBSD both used numerical values for these constants
chosen to be the same as on Solaris. For some reason, NetBSD did not,
and happens to interpret value 0 as P_ALL instead of P_PID
(see https://github.com/NetBSD/src/blob/3323ceb7822f98b3d2693aa26fd55c4ded6d8ba4/sys/sys/idtype.h#L43-L44).

Using the correct value for P_PID should cause wait6 to wait for the
correct process, which may help to avoid the deadlocks reported in

For #50138.
Updates #13987.

Change-Id: I0eacd1faee4a430d431fe48f9ccf837f49c42f39
Reviewed-on: https://go-review.googlesource.com/c/go/+/442478
Auto-Submit: Bryan Mills <bcmills@google.com>
Reviewed-by: Benny Siegert <bsiegert@gmail.com>
Run-TryBot: Bryan Mills <bcmills@google.com>
gopherbot pushed a commit that referenced this issue Oct 13, 2022
There are getting to be enough special cases in this wrapper that
the increase in clarity from having a single file is starting to be
outweighed by the complexity from chained conditionals.

Updates #50138.
Updates #13987.

Change-Id: If4f1be19c0344e249aa6092507c28363ca6c8438
Reviewed-on: https://go-review.googlesource.com/c/go/+/442575
Run-TryBot: Bryan Mills <bcmills@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Auto-Submit: Bryan Mills <bcmills@google.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
romaindoumenc pushed a commit to TroutSoftware/go that referenced this issue Nov 3, 2022
Dragonfly and FreeBSD both used numerical values for these constants
chosen to be the same as on Solaris. For some reason, NetBSD did not,
and happens to interpret value 0 as P_ALL instead of P_PID
(see https://github.com/NetBSD/src/blob/3323ceb7822f98b3d2693aa26fd55c4ded6d8ba4/sys/sys/idtype.h#L43-L44).

Using the correct value for P_PID should cause wait6 to wait for the
correct process, which may help to avoid the deadlocks reported in

For golang#50138.
Updates golang#13987.

Change-Id: I0eacd1faee4a430d431fe48f9ccf837f49c42f39
Reviewed-on: https://go-review.googlesource.com/c/go/+/442478
Auto-Submit: Bryan Mills <bcmills@google.com>
Reviewed-by: Benny Siegert <bsiegert@gmail.com>
Run-TryBot: Bryan Mills <bcmills@google.com>
romaindoumenc pushed a commit to TroutSoftware/go that referenced this issue Nov 3, 2022
There are getting to be enough special cases in this wrapper that
the increase in clarity from having a single file is starting to be
outweighed by the complexity from chained conditionals.

Updates golang#50138.
Updates golang#13987.

Change-Id: If4f1be19c0344e249aa6092507c28363ca6c8438
Reviewed-on: https://go-review.googlesource.com/c/go/+/442575
Run-TryBot: Bryan Mills <bcmills@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Auto-Submit: Bryan Mills <bcmills@google.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/528439 mentions this issue: os: use PidfdSendSignal on linux if possible

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/528798 mentions this issue: os: make use of pidfd on linux

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/528438 mentions this issue: os: make use of pidfd on linux

gopherbot pushed a commit that referenced this issue Feb 21, 2024
Use Process.handle field to store pidfd, and make use of it. Only use
pidfd functionality if all the needed syscalls are available.

1. StartProcess: obtain the pidfd from the kernel, if available,
   using the functionality added by CL 520266. Note we could not modify
   syscall.StartProcess to return pidfd directly because it is a public
   API and its callers do not expect it, so we have to use ensurePidfd
   and getPidfd.

2. (*Process).Kill: use pidfdSendSignal, if the syscall is available
   and pidfd is known. This is slightly more complicated than it should
   be, since the syscall can be blocked by e.g. seccomp security policy,
   therefore the need for a function to check if it's actually working,
   and a soft fallback to kill. Perhaps this precaution is not really
   needed.

3. (*Process).Wait: use pidfdWait, if available, otherwise fall back to
   using waitid/wait4. This is also more complicated than expected due
   to struct siginfo_t idiosyncrasy.

NOTE pidfdSendSignal and pidfdWait are used without a race workaround
(blockUntilWaitable and sigMu, added by CL 23967) because with pidfd,
PID recycle issue doesn't exist (IOW, pidfd, unlike PID, is guaranteed
to refer to one particular process) and thus the race doesn't exist
either.

For #62654.
Updates #13987.

Change-Id: I22ebcc7142b16a3a94c422d2f32504d1a80e8a8f
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Reviewed-on: https://go-review.googlesource.com/c/go/+/528438
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Michael Pratt <mpratt@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/570036 mentions this issue: os: make use of pidfd on linux

gopherbot pushed a commit that referenced this issue May 17, 2024
Use Process.handle field to store pidfd, and make use of it. Only use
pidfd functionality if all the needed syscalls are available.

1. Add/use pidfdWorks, which checks that all needed pidfd-related
   functionality works.

2. os.StartProcess: obtain the pidfd from the kernel, if possible, using
   the functionality added by CL 520266. Note we could not modify
   syscall.StartProcess to return pidfd directly because it is a public
   API and its callers do not expect it, so we have to use ensurePidfd
   and getPidfd.

3. (*Process).Kill: use pidfdSendSignal, if available and the pidfd is
   known. Otherwise, fall back to the old implementation.

4. (*Process).Wait: use pidfdWait, if available, otherwise fall back to
   using waitid/wait4. This is more complicated than expected due to
   struct siginfo_t idiosyncrasy.

NOTE pidfdSendSignal and pidfdWait are used without a race workaround
(blockUntilWaitable and sigMu, added by CL 23967) because with pidfd,
PID recycle issue doesn't exist (IOW, pidfd, unlike PID, is guaranteed
to refer to one particular process) and thus the race doesn't exist
either.

Rework of CL 528438 (reverted in CL 566477 because of #65857).

For #62654.
Updates #13987.

Change-Id: If5ef8920bd8619dc428b6282ffe4fb8c258ca224
Reviewed-on: https://go-review.googlesource.com/c/go/+/570036
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Kirill Kolyshkin <kolyshkin@gmail.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Cherry Mui <cherryyz@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants