Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure the child process receives SIGTERM when parent dies on Linux #370

Closed
wants to merge 5 commits into from

Conversation

dio
Copy link
Collaborator

@dio dio commented Sep 8, 2021

This ensures the child dies when the parent dies on Linux, by ensuring
the OS thread associated with the child process is not accidentally
killed by Go runtime.

Signed-off-by: Dhi Aurrahman dio@rockybars.com

@dio
Copy link
Collaborator Author

dio commented Sep 8, 2021

I'm still trying a way to actually test it.

@dio dio changed the title WIP: Ensure the OS thread associated with the child process is not killed Ensure the child process receives SIGTERM when parent dies on Linux Sep 9, 2021
@dio dio marked this pull request as ready for review September 9, 2021 05:51
require.NoError(t, err)

// Make sure child process receives SIGTERM right after the parent process.
require.True(t, c.ModTime().After(p.ModTime()) || c.ModTime().Equal(p.ModTime()))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requires nanosecond precision to be strict.

@dio dio requested a review from codefromthecrypt September 9, 2021 05:54
Copy link
Contributor

@codefromthecrypt codefromthecrypt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for giving this a go. I feel the technical implementation likely works :D I'd like to see more dents kicked out of the code prior to merge. Also, this needs to be tested on all operating systems, not just linux as the code could easily drift on platforms people aren't on at the moment.

I haven't yet thought of a cleaner way to define the function, but I think the current refactoring mentioned will not be throwaway until then either.. hope the early feedback helps!

internal/moreos/proc_darwin.go Outdated Show resolved Hide resolved
internal/moreos/proc_linux_test.go Outdated Show resolved Hide resolved
internal/moreos/proc_linux_test.go Outdated Show resolved Hide resolved
internal/moreos/proc_linux_test.go Outdated Show resolved Hide resolved
internal/moreos/proc_linux_test.go Outdated Show resolved Hide resolved
internal/moreos/proc_linux_test.go Outdated Show resolved Hide resolved
Signed-off-by: Dhi Aurrahman <dio@rockybars.com>
dio added 2 commits September 10, 2021 01:04
Signed-off-by: Dhi Aurrahman <dio@rockybars.com>
Signed-off-by: Dhi Aurrahman <dio@rockybars.com>
@@ -155,6 +157,68 @@ func Test_EnsureProcessDone(t *testing.T) {
require.NoError(t, EnsureProcessDone(cmd.Process))
}

func Test_EnsureChildProcessDone(t *testing.T) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@codefromthecrypt could please help to check if the scenario is correct? Thank you!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The short of this one is I would expect the test function here to directly invoke fake_func-e (not intermediated though a go run). and fake_func-e would invoke fake_envoy if it could or worst case cat. Then I would expect this test case to wait for the fake_envoy to be ready (this part is easier than using cat as you can re-use our other code and get to a known position in the process). At that point, the test would grab the process tree of fake_func-e which should only have 2 processes itself and fake_envoy. The test would then kill -9 its child (fake_func-e) and then with some assert eventually stuff ensure the corresponding fake_envoy died, too, by using the pid it had stored earlier.

I would not necessarily expect the cmd used to run fake_func-e to use the same process attributes as the go test process running it shouldn't have to be kill -9'ed, and it if did and it leaked a fake_func-e, that's sortof ok. The main thing is to only use the code under test (moreos.XXX stuff) where we are simulating its use.

We have to use compiled code because we can't really do a kill -9 test any other way at least not that I know of. I have a feeling when all this is done it really won't look complicated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ps I keep saying fake envoy, but I sometimes wonder if I should be saying RequireFakeEnvoy Have you seen it?

We already do this and handle compilation etc in a way the result is a bin we start directly. Concretely, I would expect this change to end up with a RequireFakeFuncE which re-uses almost everything, just a different source file. That way maintenance is also easier to reason with as how we deal with fake binaries is consistent

@dio
Copy link
Collaborator Author

dio commented Sep 10, 2021

@codefromthecrypt I'm rebooting this PR. I start with the test first. Please take a look if that's the expected scenario. Apparently, Windows is failing. Will do a check on Windows.

dio added 2 commits September 10, 2021 02:36
Signed-off-by: Dhi Aurrahman <dio@rockybars.com>
Signed-off-by: Dhi Aurrahman <dio@rockybars.com>
@@ -140,7 +142,7 @@ func TestProcessGroupAttr_Interrupt(t *testing.T) {

func Test_EnsureProcessDone(t *testing.T) {
// Fork a process that hangs
cmd := exec.Command("cat" + Exe)
cmd := exec.Command("sleep"+Exe, "1000")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using "cat", which seems "blocking"/"hanging", calling cmd.Wait does not block the main thread (hence it is not validating the value of "killing" the process).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please roll this back. this is only testing that calling interrupt causes the process to complete and that calling it when a process is no longer there doesn't panic.

The cmd.Wait is commented as having undefined (platform specific) output. That's why it is not checked, and that is ok. The main thing is reduce flakes and this test hasn't flaked on any of our operating systems.

Let's please not change this variable because it confuses the topics, ex is a change here causing a regression elsewhere. ITOW roll back the replacement of cat with sleep and also rollback the timeout override as our tests have been very stable. If something is now not stable we should have an easier time understanding why as it would be a change here.

@@ -33,7 +35,24 @@ func interrupt(p *os.Process) error {
}

func ensureProcessDone(p *os.Process) error {
if err := p.Kill(); err != nil && err != os.ErrProcessDone {
proc, err := process.NewProcess(int32(p.Pid))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is exactly why I was concerned with changing the tests here. We are changing code that is not related to the linux pinning thing. If we had to do this we really must do this in a separate change, so we can know if we are actually making things worse over what is a minor process control enhancement request. Notes to follow anyway, though.

// the error here.
children, _ := proc.Children()
for _, child := range children {
if err := child.Kill(); err != nil && err != process.ErrorProcessNotRunning {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not our responsibility to kill the child processes of the process before we kill the process we are killing. We can actually make things worse by doing this. The process we are managing, for example, is envoy.. we definitely don't want to interfere with its process management. The "kill -9" killing children is really an edge case we are asking the OS to manage, and in this case we are asking it to "kill -9" envoy. If envoy doesn't manage downwards that's a separate topic, and not something I would expect us to want to try.

Specifically speaking this logic simply won't be called in kill -9, so definitely shouldn't be in this change. It raises the risk of the change by a lot and there's a lot of test burden here also that even if we wanted to do I doubt anyone would want to refactor. So, stepping back... are we trying to kill envoy's children before envoy? nope.. and since we aren't doing that, I don't think this logic should be in this change at all.

Copy link
Contributor

@codefromthecrypt codefromthecrypt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm focusing the "review" part on the test itself. I think the key missing here is that unless you remember hard what we are trying to do here, you won't know by looking at the code what is going on.

I'd recommend focusing literally only on what fake func-e would do, which is start fake envoy. That way the test main feels relatable.

Next would be then testing if indeed kill -9 that fake func-e if doing so ends up killing the child due to the process attributes and os thread pinning we setup.

There's no goal for a scenario this feels like, which is manually cleaning up sub-processes of fake-envoy (which there won't be as fake-envoy doesn't launch children). So, in this case actively choose to not solve that.

The main hope is that after the test passes or is known to not pass on particular operating systems, the original desired change could be added, which was to do the os thread pinning.

I hope this helps, my main goal is to ensure the project doesn't spin out of control again, we had so much low quality code and it took us months to get rid of it. Even good code costs us a lot to maintain, so we really owe it to ourselves to not over-solve or solve things that aren't related to the use case or outside our control.

If things feel like they are needed for tests to control the "fake func-e" properly, definitely let's not put those in the main source tree for the same reason. The worst place to add hairy code is the main tree, second worst is test.

Hope these guidelines help and appreciate these areas are somewhat subjective, but we are trying very hard to make func-e prioritize maintainability over all, and have to keep that in mind with every thing we do. Each if statement is burden, each untested loop is also. We can't afford to do things that end up requiring linear work.

)

func main() {
ctx, stop := signal.NotifyContext(context.Background(), syscall.SIGKILL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we are using notify context here? kill -9 can't be trapped. This line adds confusion due to this.. If we did use a trapping for another reason (ex ad-hoc testing), we might consider commenting that, and then using CommandContext to start the child, or commenting why we aren't doing that.

func main() {
ctx, stop := signal.NotifyContext(context.Background(), syscall.SIGKILL)
defer stop()
cmd := exec.Command("sleep"+moreos.Exe, "1000")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you choose to use sleep instead of cat, your tests get more complicated because you have to differentiate between the process dying because the magic 1s here vs dying otherwise. All this extra if, etc logic can push this change into the "not worth doing" category. I highly recommend not using sleep (we used to use sleep and switched)

Moreover, if this is supposed to be simulating func-e maybe it should call our fakeenvoy instead?

ctx, stop := signal.NotifyContext(context.Background(), syscall.SIGKILL)
defer stop()
cmd := exec.Command("sleep"+moreos.Exe, "1000")
cmd.SysProcAttr = moreos.ProcessGroupAttr()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when the code doesn't look like func-e we have new questions like.. why is this using ProcessGroupAttr? and if it is, and this change is about the being fake-func-e, how about calling the main that, instead of making a main/main.go do like our other thing fake_envoy.go and call this fake_func-e.go?

Copy link
Collaborator Author

@dio dio Sep 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SOrry that missed the fake_envoy thing since I thought we can make it as narrow as possible for testing Pdeathsig, which I thought (in my mind) is a generalized way for making sure killing a parent kills children.

// There are 3 processes spawned by this test target, with the following relationship:
//
// runner (go run testdata/main/main.go)
// └── main (the executable, compiled from testdata/main/main.go. This is a fake func-e)
Copy link
Contributor

@codefromthecrypt codefromthecrypt Sep 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ps I will make comments like this on any and all commits, so going to do this here. In all naming and organization things, simplest first. If there are no subdirs needed (ex main/) don't add one. If you have to say X is Y, call it Y. This helps make the cognitive associations fast, and ends up with less lines and words per line to convey the same thing

make sense?

Suggested change
// └── main (the executable, compiled from testdata/main/main.go. This is a fake func-e)
// └── fake_func-e (the executable, compiled from testdata/fake_func-e.go)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Thanks!

@dio
Copy link
Collaborator Author

dio commented Sep 11, 2021

Thanks, @codefromthecrypt for your time looking at this. I'll redo this. So, to clarify, do you want me to test specifically for Pdeathsig on Linux (according to the issue) and only that? Sorry I was confused since in our discussion it seems (sorry I caught you wrong) we also want to have no special treatment for the goroutine-os-thread-pinning (LockOSThread) since it is not Linux specific.

So for testing, where do you want me to add it? proc_linux_test.go (or moreos_linux_test.go? by spawning fake_func-e that spawns fake_envoy) or extending run_test with kill() instead of ctrlC()? Thanks!

@dio
Copy link
Collaborator Author

dio commented Sep 11, 2021

I pushed another attempt here #371 that specifically tests sending SIGKIL to a fake func-e process that runs envoy as its child and is initialized using the same SysProcAttr as the real func-e (calling moreos.ProcessGroupAttr()).

@dio
Copy link
Collaborator Author

dio commented Sep 11, 2021

Closing this in favor of #371.

@dio dio closed this Sep 11, 2021
@dio dio deleted the cmd-start-linux branch September 14, 2021 01:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants