Ensure the child process receives SIGTERM when parent dies on Linux #370

dio · 2021-09-08T07:58:20Z

This ensures the child dies when the parent dies on Linux, by ensuring
the OS thread associated with the child process is not accidentally
killed by Go runtime.

Signed-off-by: Dhi Aurrahman dio@rockybars.com

dio · 2021-09-08T07:58:39Z

I'm still trying a way to actually test it.

dio · 2021-09-09T05:53:00Z

internal/moreos/proc_linux_test.go

+	require.NoError(t, err)
+
+	// Make sure child process receives SIGTERM right after the parent process.
+	require.True(t, c.ModTime().After(p.ModTime()) || c.ModTime().Equal(p.ModTime()))


Requires nanosecond precision to be strict.

codefromthecrypt

Thanks for giving this a go. I feel the technical implementation likely works :D I'd like to see more dents kicked out of the code prior to merge. Also, this needs to be tested on all operating systems, not just linux as the code could easily drift on platforms people aren't on at the moment.

I haven't yet thought of a cleaner way to define the function, but I think the current refactoring mentioned will not be throwaway until then either.. hope the early feedback helps!

internal/moreos/proc_darwin.go

internal/moreos/proc_linux_test.go

Signed-off-by: Dhi Aurrahman <dio@rockybars.com>

dio · 2021-09-10T08:50:44Z

internal/moreos/moreos_test.go

@@ -155,6 +157,68 @@ func Test_EnsureProcessDone(t *testing.T) {
 	require.NoError(t, EnsureProcessDone(cmd.Process))
 }

+func Test_EnsureChildProcessDone(t *testing.T) {


@codefromthecrypt could please help to check if the scenario is correct? Thank you!

The short of this one is I would expect the test function here to directly invoke fake_func-e (not intermediated though a go run). and fake_func-e would invoke fake_envoy if it could or worst case cat. Then I would expect this test case to wait for the fake_envoy to be ready (this part is easier than using cat as you can re-use our other code and get to a known position in the process). At that point, the test would grab the process tree of fake_func-e which should only have 2 processes itself and fake_envoy. The test would then kill -9 its child (fake_func-e) and then with some assert eventually stuff ensure the corresponding fake_envoy died, too, by using the pid it had stored earlier.

I would not necessarily expect the cmd used to run fake_func-e to use the same process attributes as the go test process running it shouldn't have to be kill -9'ed, and it if did and it leaked a fake_func-e, that's sortof ok. The main thing is to only use the code under test (moreos.XXX stuff) where we are simulating its use.

We have to use compiled code because we can't really do a kill -9 test any other way at least not that I know of. I have a feeling when all this is done it really won't look complicated.

Ps I keep saying fake envoy, but I sometimes wonder if I should be saying RequireFakeEnvoy Have you seen it?

We already do this and handle compilation etc in a way the result is a bin we start directly. Concretely, I would expect this change to end up with a RequireFakeFuncE which re-uses almost everything, just a different source file. That way maintenance is also easier to reason with as how we deal with fake binaries is consistent

dio · 2021-09-10T08:56:04Z

@codefromthecrypt I'm rebooting this PR. I start with the test first. Please take a look if that's the expected scenario. Apparently, Windows is failing. Will do a check on Windows.

Signed-off-by: Dhi Aurrahman <dio@rockybars.com>

dio · 2021-09-10T20:42:50Z

internal/moreos/moreos_test.go

@@ -140,7 +142,7 @@ func TestProcessGroupAttr_Interrupt(t *testing.T) {

 func Test_EnsureProcessDone(t *testing.T) {
 	// Fork a process that hangs
-	cmd := exec.Command("cat" + Exe)
+	cmd := exec.Command("sleep"+Exe, "1000")


When using "cat", which seems "blocking"/"hanging", calling cmd.Wait does not block the main thread (hence it is not validating the value of "killing" the process).

please roll this back. this is only testing that calling interrupt causes the process to complete and that calling it when a process is no longer there doesn't panic.

The cmd.Wait is commented as having undefined (platform specific) output. That's why it is not checked, and that is ok. The main thing is reduce flakes and this test hasn't flaked on any of our operating systems.

Let's please not change this variable because it confuses the topics, ex is a change here causing a regression elsewhere. ITOW roll back the replacement of cat with sleep and also rollback the timeout override as our tests have been very stable. If something is now not stable we should have an easier time understanding why as it would be a change here.

codefromthecrypt · 2021-09-10T21:50:21Z

internal/moreos/proc_darwin.go

@@ -33,7 +35,24 @@ func interrupt(p *os.Process) error {
 }

 func ensureProcessDone(p *os.Process) error {
-	if err := p.Kill(); err != nil && err != os.ErrProcessDone {
+	proc, err := process.NewProcess(int32(p.Pid))


this is exactly why I was concerned with changing the tests here. We are changing code that is not related to the linux pinning thing. If we had to do this we really must do this in a separate change, so we can know if we are actually making things worse over what is a minor process control enhancement request. Notes to follow anyway, though.

codefromthecrypt · 2021-09-10T21:56:16Z

internal/moreos/proc_darwin.go

+	// the error here.
+	children, _ := proc.Children()
+	for _, child := range children {
+		if err := child.Kill(); err != nil && err != process.ErrorProcessNotRunning {


It is not our responsibility to kill the child processes of the process before we kill the process we are killing. We can actually make things worse by doing this. The process we are managing, for example, is envoy.. we definitely don't want to interfere with its process management. The "kill -9" killing children is really an edge case we are asking the OS to manage, and in this case we are asking it to "kill -9" envoy. If envoy doesn't manage downwards that's a separate topic, and not something I would expect us to want to try.

Specifically speaking this logic simply won't be called in kill -9, so definitely shouldn't be in this change. It raises the risk of the change by a lot and there's a lot of test burden here also that even if we wanted to do I doubt anyone would want to refactor. So, stepping back... are we trying to kill envoy's children before envoy? nope.. and since we aren't doing that, I don't think this logic should be in this change at all.

codefromthecrypt

I'm focusing the "review" part on the test itself. I think the key missing here is that unless you remember hard what we are trying to do here, you won't know by looking at the code what is going on.

I'd recommend focusing literally only on what fake func-e would do, which is start fake envoy. That way the test main feels relatable.

Next would be then testing if indeed kill -9 that fake func-e if doing so ends up killing the child due to the process attributes and os thread pinning we setup.

There's no goal for a scenario this feels like, which is manually cleaning up sub-processes of fake-envoy (which there won't be as fake-envoy doesn't launch children). So, in this case actively choose to not solve that.

The main hope is that after the test passes or is known to not pass on particular operating systems, the original desired change could be added, which was to do the os thread pinning.

I hope this helps, my main goal is to ensure the project doesn't spin out of control again, we had so much low quality code and it took us months to get rid of it. Even good code costs us a lot to maintain, so we really owe it to ourselves to not over-solve or solve things that aren't related to the use case or outside our control.

If things feel like they are needed for tests to control the "fake func-e" properly, definitely let's not put those in the main source tree for the same reason. The worst place to add hairy code is the main tree, second worst is test.

Hope these guidelines help and appreciate these areas are somewhat subjective, but we are trying very hard to make func-e prioritize maintainability over all, and have to keep that in mind with every thing we do. Each if statement is burden, each untested loop is also. We can't afford to do things that end up requiring linear work.

codefromthecrypt · 2021-09-10T21:58:00Z

internal/moreos/testdata/main/main.go

+)
+
+func main() {
+	ctx, stop := signal.NotifyContext(context.Background(), syscall.SIGKILL)


I don't understand why we are using notify context here? kill -9 can't be trapped. This line adds confusion due to this.. If we did use a trapping for another reason (ex ad-hoc testing), we might consider commenting that, and then using CommandContext to start the child, or commenting why we aren't doing that.

codefromthecrypt · 2021-09-10T22:02:38Z

internal/moreos/testdata/main/main.go

+func main() {
+	ctx, stop := signal.NotifyContext(context.Background(), syscall.SIGKILL)
+	defer stop()
+	cmd := exec.Command("sleep"+moreos.Exe, "1000")


if you choose to use sleep instead of cat, your tests get more complicated because you have to differentiate between the process dying because the magic 1s here vs dying otherwise. All this extra if, etc logic can push this change into the "not worth doing" category. I highly recommend not using sleep (we used to use sleep and switched)

Moreover, if this is supposed to be simulating func-e maybe it should call our fakeenvoy instead?

codefromthecrypt · 2021-09-10T22:07:19Z

internal/moreos/testdata/main/main.go

+	ctx, stop := signal.NotifyContext(context.Background(), syscall.SIGKILL)
+	defer stop()
+	cmd := exec.Command("sleep"+moreos.Exe, "1000")
+	cmd.SysProcAttr = moreos.ProcessGroupAttr()


when the code doesn't look like func-e we have new questions like.. why is this using ProcessGroupAttr? and if it is, and this change is about the being fake-func-e, how about calling the main that, instead of making a main/main.go do like our other thing fake_envoy.go and call this fake_func-e.go?

SOrry that missed the fake_envoy thing since I thought we can make it as narrow as possible for testing Pdeathsig, which I thought (in my mind) is a generalized way for making sure killing a parent kills children.

codefromthecrypt · 2021-09-10T22:41:07Z

internal/moreos/moreos_test.go

+	// There are 3 processes spawned by this test target, with the following relationship:
+	//
+	// runner (go run testdata/main/main.go)
+	// └── main (the executable, compiled from testdata/main/main.go. This is a fake func-e)


ps I will make comments like this on any and all commits, so going to do this here. In all naming and organization things, simplest first. If there are no subdirs needed (ex main/) don't add one. If you have to say X is Y, call it Y. This helps make the cognitive associations fast, and ends up with less lines and words per line to convey the same thing

make sense?

Suggested change

// └── main (the executable, compiled from testdata/main/main.go. This is a fake func-e)

// └── fake_func-e (the executable, compiled from testdata/fake_func-e.go)

Got it. Thanks!

dio · 2021-09-11T02:55:09Z

Thanks, @codefromthecrypt for your time looking at this. I'll redo this. So, to clarify, do you want me to test specifically for Pdeathsig on Linux (according to the issue) and only that? Sorry I was confused since in our discussion it seems (sorry I caught you wrong) we also want to have no special treatment for the goroutine-os-thread-pinning (LockOSThread) since it is not Linux specific.

So for testing, where do you want me to add it? proc_linux_test.go (or moreos_linux_test.go? by spawning fake_func-e that spawns fake_envoy) or extending run_test with kill() instead of ctrlC()? Thanks!

dio · 2021-09-11T08:55:14Z

I pushed another attempt here #371 that specifically tests sending SIGKIL to a fake func-e process that runs envoy as its child and is initialized using the same SysProcAttr as the real func-e (calling moreos.ProcessGroupAttr()).

dio · 2021-09-11T09:35:48Z

Closing this in favor of #371.

dio changed the title ~~WIP: Ensure the OS thread associated with the child process is not killed~~ Ensure the child process receives SIGTERM when parent dies on Linux Sep 9, 2021

dio marked this pull request as ready for review September 9, 2021 05:51

dio commented Sep 9, 2021

View reviewed changes

dio requested a review from codefromthecrypt September 9, 2021 05:54

codefromthecrypt reviewed Sep 9, 2021

View reviewed changes

Add test for ensuring the child process is dead

07df31d

Signed-off-by: Dhi Aurrahman <dio@rockybars.com>

dio force-pushed the cmd-start-linux branch from cef7467 to 07df31d Compare September 10, 2021 06:44

dio added 2 commits September 10, 2021 01:04

Fix macOS

6a21fc8

Signed-off-by: Dhi Aurrahman <dio@rockybars.com>

Fix macOS again

e663d28

Signed-off-by: Dhi Aurrahman <dio@rockybars.com>

dio commented Sep 10, 2021

View reviewed changes

dio added 2 commits September 10, 2021 02:36

Fix again

5a3db03

Signed-off-by: Dhi Aurrahman <dio@rockybars.com>

Simplify pid checking

087f6bf

Signed-off-by: Dhi Aurrahman <dio@rockybars.com>

dio commented Sep 10, 2021

View reviewed changes

codefromthecrypt reviewed Sep 10, 2021

View reviewed changes

dio closed this Sep 11, 2021

dio deleted the cmd-start-linux branch September 14, 2021 01:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure the child process receives SIGTERM when parent dies on Linux #370

Ensure the child process receives SIGTERM when parent dies on Linux #370

dio commented Sep 8, 2021

dio commented Sep 8, 2021 •

edited

Loading

dio Sep 9, 2021

codefromthecrypt left a comment

dio Sep 10, 2021

codefromthecrypt Sep 10, 2021

codefromthecrypt Sep 10, 2021

dio commented Sep 10, 2021 •

edited

Loading

dio Sep 10, 2021

codefromthecrypt Sep 10, 2021

codefromthecrypt Sep 10, 2021

codefromthecrypt Sep 10, 2021

codefromthecrypt left a comment

codefromthecrypt Sep 10, 2021

codefromthecrypt Sep 10, 2021

codefromthecrypt Sep 10, 2021

dio Sep 11, 2021 •

edited

Loading

codefromthecrypt Sep 10, 2021 •

edited

Loading

dio Sep 11, 2021

dio commented Sep 11, 2021 •

edited

Loading

dio commented Sep 11, 2021

dio commented Sep 11, 2021

	// └── main (the executable, compiled from testdata/main/main.go. This is a fake func-e)
	// └── fake_func-e (the executable, compiled from testdata/fake_func-e.go)

Ensure the child process receives SIGTERM when parent dies on Linux #370

Ensure the child process receives SIGTERM when parent dies on Linux #370

Conversation

dio commented Sep 8, 2021

dio commented Sep 8, 2021 • edited Loading

Choose a reason for hiding this comment

codefromthecrypt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dio commented Sep 10, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codefromthecrypt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dio Sep 11, 2021 • edited Loading

Choose a reason for hiding this comment

codefromthecrypt Sep 10, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dio commented Sep 11, 2021 • edited Loading

dio commented Sep 11, 2021

dio commented Sep 11, 2021

dio commented Sep 8, 2021 •

edited

Loading

dio commented Sep 10, 2021 •

edited

Loading

dio Sep 11, 2021 •

edited

Loading

codefromthecrypt Sep 10, 2021 •

edited

Loading

dio commented Sep 11, 2021 •

edited

Loading