Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure the child process receives SIGTERM when parent dies on Linux #370

Closed
wants to merge 5 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ build: $(current_binary) ## Build the func-e binary
test_packages := . ./internal/...
test: ## Run all unit tests
@printf "$(ansi_format_dark)" test "running unit tests"
@go test . $(test_packages)
@go test -timeout 30s . $(test_packages)
@printf "$(ansi_format_bright)" test "ok"

coverage: ## Generate test coverage
Expand Down
65 changes: 63 additions & 2 deletions internal/moreos/moreos_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
package moreos

import (
"bufio"
"bytes"
"errors"
"fmt"
Expand All @@ -23,6 +24,7 @@ import (
"path/filepath"
"runtime"
"testing"
"time"

"github.com/shirou/gopsutil/v3/process"
"github.com/stretchr/testify/require"
Expand Down Expand Up @@ -120,7 +122,7 @@ func TestSprintf_IdiomaticPerOS(t *testing.T) {

func TestProcessGroupAttr_Interrupt(t *testing.T) {
// Fork a process that hangs
cmd := exec.Command("cat" + Exe)
cmd := exec.Command("sleep"+Exe, "1000")
cmd.SysProcAttr = ProcessGroupAttr()
require.NoError(t, cmd.Start())

Expand All @@ -140,7 +142,7 @@ func TestProcessGroupAttr_Interrupt(t *testing.T) {

func Test_EnsureProcessDone(t *testing.T) {
// Fork a process that hangs
cmd := exec.Command("cat" + Exe)
cmd := exec.Command("sleep"+Exe, "1000")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using "cat", which seems "blocking"/"hanging", calling cmd.Wait does not block the main thread (hence it is not validating the value of "killing" the process).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please roll this back. this is only testing that calling interrupt causes the process to complete and that calling it when a process is no longer there doesn't panic.

The cmd.Wait is commented as having undefined (platform specific) output. That's why it is not checked, and that is ok. The main thing is reduce flakes and this test hasn't flaked on any of our operating systems.

Let's please not change this variable because it confuses the topics, ex is a change here causing a regression elsewhere. ITOW roll back the replacement of cat with sleep and also rollback the timeout override as our tests have been very stable. If something is now not stable we should have an easier time understanding why as it would be a change here.

cmd.SysProcAttr = ProcessGroupAttr()
require.NoError(t, cmd.Start())

Expand All @@ -155,6 +157,65 @@ func Test_EnsureProcessDone(t *testing.T) {
require.NoError(t, EnsureProcessDone(cmd.Process))
}

func Test_EnsureChildProcessDone(t *testing.T) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@codefromthecrypt could please help to check if the scenario is correct? Thank you!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The short of this one is I would expect the test function here to directly invoke fake_func-e (not intermediated though a go run). and fake_func-e would invoke fake_envoy if it could or worst case cat. Then I would expect this test case to wait for the fake_envoy to be ready (this part is easier than using cat as you can re-use our other code and get to a known position in the process). At that point, the test would grab the process tree of fake_func-e which should only have 2 processes itself and fake_envoy. The test would then kill -9 its child (fake_func-e) and then with some assert eventually stuff ensure the corresponding fake_envoy died, too, by using the pid it had stored earlier.

I would not necessarily expect the cmd used to run fake_func-e to use the same process attributes as the go test process running it shouldn't have to be kill -9'ed, and it if did and it leaked a fake_func-e, that's sortof ok. The main thing is to only use the code under test (moreos.XXX stuff) where we are simulating its use.

We have to use compiled code because we can't really do a kill -9 test any other way at least not that I know of. I have a feeling when all this is done it really won't look complicated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ps I keep saying fake envoy, but I sometimes wonder if I should be saying RequireFakeEnvoy Have you seen it?

We already do this and handle compilation etc in a way the result is a bin we start directly. Concretely, I would expect this change to end up with a RequireFakeFuncE which re-uses almost everything, just a different source file. That way maintenance is also easier to reason with as how we deal with fake binaries is consistent

// There are 3 processes spawned by this test target, with the following relationship:
//
// runner (go run testdata/main/main.go)
// └── main (the executable, compiled from testdata/main/main.go. This is a fake func-e)
Copy link
Contributor

@codefromthecrypt codefromthecrypt Sep 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ps I will make comments like this on any and all commits, so going to do this here. In all naming and organization things, simplest first. If there are no subdirs needed (ex main/) don't add one. If you have to say X is Y, call it Y. This helps make the cognitive associations fast, and ends up with less lines and words per line to convey the same thing

make sense?

Suggested change
// └── main (the executable, compiled from testdata/main/main.go. This is a fake func-e)
// └── fake_func-e (the executable, compiled from testdata/fake_func-e.go)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Thanks!

// └── child (sleep, this is a fake envoy)
cmd := exec.Command("go"+Exe, "run", filepath.Join("testdata", "main", "main.go"))
stdout, err := cmd.StdoutPipe()
require.NoError(t, err)
cmd.Start()

buf := bufio.NewReader(stdout)
for {
line, _, readErr := buf.ReadLine()
require.NoError(t, readErr)
// As soon as we detect output from the main program, we know the child process in main program
// is started. See: testdata/main/main.go#34.
if string(line) == "ok" {
break
}
}

// This is the "go run" process.
runner, err := process.NewProcess(int32(cmd.Process.Pid))
require.NoError(t, err)

parents, err := runner.Children()
require.NoError(t, err)
require.Equal(t, len(parents), 1)

// This is the main process executed by "go run".
main, err := process.NewProcess(parents[0].Pid)
require.NoError(t, err)

children, err := main.Children()
require.NoError(t, err)
require.Equal(t, len(children), 1)

// Kill the main process.
require.NoError(t, EnsureProcessDone(&os.Process{Pid: int(main.Pid)}))

// Wait and check the status of child and main processes.
// The following check is here to make sure the child and main processes are not killed because
// of the runner is died.
time.Sleep(time.Millisecond)
child := children[0]
require.Error(t, findProcess(&os.Process{Pid: int(child.Pid)}))
require.Error(t, findProcess(&os.Process{Pid: int(main.Pid)}))

// The runner needs to be killed too.
require.NoError(t, EnsureProcessDone(&os.Process{Pid: int(runner.Pid)}))
cmd.Wait() //nolint

require.Error(t, findProcess(&os.Process{Pid: int(runner.Pid)}))

// Ensure killing it again doesn't error.
require.NoError(t, EnsureProcessDone(&os.Process{Pid: int(main.Pid)}))
}

func findProcess(proc *os.Process) error {
_, err := process.NewProcess(int32(proc.Pid)) // because os.FindProcess is no-op in Linux!
return err
Expand Down
21 changes: 20 additions & 1 deletion internal/moreos/proc_darwin.go
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@ package moreos
import (
"os"
"syscall"

"github.com/shirou/gopsutil/v3/process"
)

const exe = ""
Expand All @@ -33,7 +35,24 @@ func interrupt(p *os.Process) error {
}

func ensureProcessDone(p *os.Process) error {
if err := p.Kill(); err != nil && err != os.ErrProcessDone {
proc, err := process.NewProcess(int32(p.Pid))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is exactly why I was concerned with changing the tests here. We are changing code that is not related to the linux pinning thing. If we had to do this we really must do this in a separate change, so we can know if we are actually making things worse over what is a minor process control enhancement request. Notes to follow anyway, though.

if err != nil {
if err == process.ErrorProcessNotRunning {
return nil
}
return err
}

// on macOS, when a process doesn't have children, pgrep gives "exit status 1", hence we ignore
// the error here.
children, _ := proc.Children()
for _, child := range children {
if err := child.Kill(); err != nil && err != process.ErrorProcessNotRunning {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not our responsibility to kill the child processes of the process before we kill the process we are killing. We can actually make things worse by doing this. The process we are managing, for example, is envoy.. we definitely don't want to interfere with its process management. The "kill -9" killing children is really an edge case we are asking the OS to manage, and in this case we are asking it to "kill -9" envoy. If envoy doesn't manage downwards that's a separate topic, and not something I would expect us to want to try.

Specifically speaking this logic simply won't be called in kill -9, so definitely shouldn't be in this change. It raises the risk of the change by a lot and there's a lot of test burden here also that even if we wanted to do I doubt anyone would want to refactor. So, stepping back... are we trying to kill envoy's children before envoy? nope.. and since we aren't doing that, I don't think this logic should be in this change at all.

return err
}
}

if err := proc.Kill(); err != nil && err != os.ErrProcessDone {
return err
}
return nil
Expand Down
37 changes: 37 additions & 0 deletions internal/moreos/testdata/main/main.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
// Copyright 2021 Tetrate
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

package main

import (
"context"
"fmt"
"os"
"os/exec"
"os/signal"
"syscall"

"github.com/tetratelabs/func-e/internal/moreos"
)

func main() {
ctx, stop := signal.NotifyContext(context.Background(), syscall.SIGKILL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we are using notify context here? kill -9 can't be trapped. This line adds confusion due to this.. If we did use a trapping for another reason (ex ad-hoc testing), we might consider commenting that, and then using CommandContext to start the child, or commenting why we aren't doing that.

defer stop()
cmd := exec.Command("sleep"+moreos.Exe, "1000")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you choose to use sleep instead of cat, your tests get more complicated because you have to differentiate between the process dying because the magic 1s here vs dying otherwise. All this extra if, etc logic can push this change into the "not worth doing" category. I highly recommend not using sleep (we used to use sleep and switched)

Moreover, if this is supposed to be simulating func-e maybe it should call our fakeenvoy instead?

cmd.SysProcAttr = moreos.ProcessGroupAttr()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when the code doesn't look like func-e we have new questions like.. why is this using ProcessGroupAttr? and if it is, and this change is about the being fake-func-e, how about calling the main that, instead of making a main/main.go do like our other thing fake_envoy.go and call this fake_func-e.go?

Copy link
Collaborator Author

@dio dio Sep 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SOrry that missed the fake_envoy thing since I thought we can make it as narrow as possible for testing Pdeathsig, which I thought (in my mind) is a generalized way for making sure killing a parent kills children.

cmd.Start()
fmt.Fprintln(os.Stdout, "ok")
cmd.Wait()
<-ctx.Done()
}