-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure the child process receives SIGTERM when parent dies on Linux #370
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -15,6 +15,7 @@ | |||||
package moreos | ||||||
|
||||||
import ( | ||||||
"bufio" | ||||||
"bytes" | ||||||
"errors" | ||||||
"fmt" | ||||||
|
@@ -23,6 +24,7 @@ import ( | |||||
"path/filepath" | ||||||
"runtime" | ||||||
"testing" | ||||||
"time" | ||||||
|
||||||
"github.com/shirou/gopsutil/v3/process" | ||||||
"github.com/stretchr/testify/require" | ||||||
|
@@ -120,7 +122,7 @@ func TestSprintf_IdiomaticPerOS(t *testing.T) { | |||||
|
||||||
func TestProcessGroupAttr_Interrupt(t *testing.T) { | ||||||
// Fork a process that hangs | ||||||
cmd := exec.Command("cat" + Exe) | ||||||
cmd := exec.Command("sleep"+Exe, "1000") | ||||||
cmd.SysProcAttr = ProcessGroupAttr() | ||||||
require.NoError(t, cmd.Start()) | ||||||
|
||||||
|
@@ -140,7 +142,7 @@ func TestProcessGroupAttr_Interrupt(t *testing.T) { | |||||
|
||||||
func Test_EnsureProcessDone(t *testing.T) { | ||||||
// Fork a process that hangs | ||||||
cmd := exec.Command("cat" + Exe) | ||||||
cmd := exec.Command("sleep"+Exe, "1000") | ||||||
cmd.SysProcAttr = ProcessGroupAttr() | ||||||
require.NoError(t, cmd.Start()) | ||||||
|
||||||
|
@@ -155,6 +157,65 @@ func Test_EnsureProcessDone(t *testing.T) { | |||||
require.NoError(t, EnsureProcessDone(cmd.Process)) | ||||||
} | ||||||
|
||||||
func Test_EnsureChildProcessDone(t *testing.T) { | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @codefromthecrypt could please help to check if the scenario is correct? Thank you! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The short of this one is I would expect the test function here to directly invoke fake_func-e (not intermediated though a I would not necessarily expect the cmd used to run fake_func-e to use the same process attributes as the We have to use compiled code because we can't really do a kill -9 test any other way at least not that I know of. I have a feeling when all this is done it really won't look complicated. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ps I keep saying fake envoy, but I sometimes wonder if I should be saying We already do this and handle compilation etc in a way the result is a bin we start directly. Concretely, I would expect this change to end up with a |
||||||
// There are 3 processes spawned by this test target, with the following relationship: | ||||||
// | ||||||
// runner (go run testdata/main/main.go) | ||||||
// └── main (the executable, compiled from testdata/main/main.go. This is a fake func-e) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ps I will make comments like this on any and all commits, so going to do this here. In all naming and organization things, simplest first. If there are no subdirs needed (ex main/) don't add one. If you have to say X is Y, call it Y. This helps make the cognitive associations fast, and ends up with less lines and words per line to convey the same thing make sense?
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Got it. Thanks! |
||||||
// └── child (sleep, this is a fake envoy) | ||||||
cmd := exec.Command("go"+Exe, "run", filepath.Join("testdata", "main", "main.go")) | ||||||
stdout, err := cmd.StdoutPipe() | ||||||
require.NoError(t, err) | ||||||
cmd.Start() | ||||||
|
||||||
buf := bufio.NewReader(stdout) | ||||||
for { | ||||||
line, _, readErr := buf.ReadLine() | ||||||
require.NoError(t, readErr) | ||||||
// As soon as we detect output from the main program, we know the child process in main program | ||||||
// is started. See: testdata/main/main.go#34. | ||||||
if string(line) == "ok" { | ||||||
break | ||||||
} | ||||||
} | ||||||
|
||||||
// This is the "go run" process. | ||||||
runner, err := process.NewProcess(int32(cmd.Process.Pid)) | ||||||
require.NoError(t, err) | ||||||
|
||||||
parents, err := runner.Children() | ||||||
require.NoError(t, err) | ||||||
require.Equal(t, len(parents), 1) | ||||||
|
||||||
// This is the main process executed by "go run". | ||||||
main, err := process.NewProcess(parents[0].Pid) | ||||||
require.NoError(t, err) | ||||||
|
||||||
children, err := main.Children() | ||||||
require.NoError(t, err) | ||||||
require.Equal(t, len(children), 1) | ||||||
|
||||||
// Kill the main process. | ||||||
require.NoError(t, EnsureProcessDone(&os.Process{Pid: int(main.Pid)})) | ||||||
|
||||||
// Wait and check the status of child and main processes. | ||||||
// The following check is here to make sure the child and main processes are not killed because | ||||||
// of the runner is died. | ||||||
time.Sleep(time.Millisecond) | ||||||
child := children[0] | ||||||
require.Error(t, findProcess(&os.Process{Pid: int(child.Pid)})) | ||||||
require.Error(t, findProcess(&os.Process{Pid: int(main.Pid)})) | ||||||
|
||||||
// The runner needs to be killed too. | ||||||
require.NoError(t, EnsureProcessDone(&os.Process{Pid: int(runner.Pid)})) | ||||||
cmd.Wait() //nolint | ||||||
|
||||||
require.Error(t, findProcess(&os.Process{Pid: int(runner.Pid)})) | ||||||
|
||||||
// Ensure killing it again doesn't error. | ||||||
require.NoError(t, EnsureProcessDone(&os.Process{Pid: int(main.Pid)})) | ||||||
} | ||||||
|
||||||
func findProcess(proc *os.Process) error { | ||||||
_, err := process.NewProcess(int32(proc.Pid)) // because os.FindProcess is no-op in Linux! | ||||||
return err | ||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -17,6 +17,8 @@ package moreos | |
import ( | ||
"os" | ||
"syscall" | ||
|
||
"github.com/shirou/gopsutil/v3/process" | ||
) | ||
|
||
const exe = "" | ||
|
@@ -33,7 +35,24 @@ func interrupt(p *os.Process) error { | |
} | ||
|
||
func ensureProcessDone(p *os.Process) error { | ||
if err := p.Kill(); err != nil && err != os.ErrProcessDone { | ||
proc, err := process.NewProcess(int32(p.Pid)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is exactly why I was concerned with changing the tests here. We are changing code that is not related to the linux pinning thing. If we had to do this we really must do this in a separate change, so we can know if we are actually making things worse over what is a minor process control enhancement request. Notes to follow anyway, though. |
||
if err != nil { | ||
if err == process.ErrorProcessNotRunning { | ||
return nil | ||
} | ||
return err | ||
} | ||
|
||
// on macOS, when a process doesn't have children, pgrep gives "exit status 1", hence we ignore | ||
// the error here. | ||
children, _ := proc.Children() | ||
for _, child := range children { | ||
if err := child.Kill(); err != nil && err != process.ErrorProcessNotRunning { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is not our responsibility to kill the child processes of the process before we kill the process we are killing. We can actually make things worse by doing this. The process we are managing, for example, is envoy.. we definitely don't want to interfere with its process management. The "kill -9" killing children is really an edge case we are asking the OS to manage, and in this case we are asking it to "kill -9" envoy. If envoy doesn't manage downwards that's a separate topic, and not something I would expect us to want to try. Specifically speaking this logic simply won't be called in kill -9, so definitely shouldn't be in this change. It raises the risk of the change by a lot and there's a lot of test burden here also that even if we wanted to do I doubt anyone would want to refactor. So, stepping back... are we trying to kill envoy's children before envoy? nope.. and since we aren't doing that, I don't think this logic should be in this change at all. |
||
return err | ||
} | ||
} | ||
|
||
if err := proc.Kill(); err != nil && err != os.ErrProcessDone { | ||
return err | ||
} | ||
return nil | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
// Copyright 2021 Tetrate | ||
// | ||
// Licensed under the Apache License, Version 2.0 (the "License"); | ||
// you may not use this file except in compliance with the License. | ||
// You may obtain a copy of the License at | ||
// | ||
// http://www.apache.org/licenses/LICENSE-2.0 | ||
// | ||
// Unless required by applicable law or agreed to in writing, software | ||
// distributed under the License is distributed on an "AS IS" BASIS, | ||
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
// See the License for the specific language governing permissions and | ||
// limitations under the License. | ||
|
||
package main | ||
|
||
import ( | ||
"context" | ||
"fmt" | ||
"os" | ||
"os/exec" | ||
"os/signal" | ||
"syscall" | ||
|
||
"github.com/tetratelabs/func-e/internal/moreos" | ||
) | ||
|
||
func main() { | ||
ctx, stop := signal.NotifyContext(context.Background(), syscall.SIGKILL) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't understand why we are using notify context here? kill -9 can't be trapped. This line adds confusion due to this.. If we did use a trapping for another reason (ex ad-hoc testing), we might consider commenting that, and then using |
||
defer stop() | ||
cmd := exec.Command("sleep"+moreos.Exe, "1000") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if you choose to use sleep instead of cat, your tests get more complicated because you have to differentiate between the process dying because the magic 1s here vs dying otherwise. All this extra if, etc logic can push this change into the "not worth doing" category. I highly recommend not using sleep (we used to use sleep and switched) Moreover, if this is supposed to be simulating func-e maybe it should call our fakeenvoy instead? |
||
cmd.SysProcAttr = moreos.ProcessGroupAttr() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. when the code doesn't look like func-e we have new questions like.. why is this using ProcessGroupAttr? and if it is, and this change is about the being fake-func-e, how about calling the main that, instead of making a main/main.go do like our other thing There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. SOrry that missed the fake_envoy thing since I thought we can make it as narrow as possible for testing Pdeathsig, which I thought (in my mind) is a generalized way for making sure killing a parent kills children. |
||
cmd.Start() | ||
fmt.Fprintln(os.Stdout, "ok") | ||
cmd.Wait() | ||
<-ctx.Done() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using "cat", which seems "blocking"/"hanging", calling
cmd.Wait
does not block the main thread (hence it is not validating the value of "killing" the process).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please roll this back. this is only testing that calling interrupt causes the process to complete and that calling it when a process is no longer there doesn't panic.
The cmd.Wait is commented as having undefined (platform specific) output. That's why it is not checked, and that is ok. The main thing is reduce flakes and this test hasn't flaked on any of our operating systems.
Let's please not change this variable because it confuses the topics, ex is a change here causing a regression elsewhere. ITOW roll back the replacement of cat with sleep and also rollback the timeout override as our tests have been very stable. If something is now not stable we should have an easier time understanding why as it would be a change here.