-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure the child process receives SIGTERM when parent dies on Linux #370
Conversation
I'm still trying a way to actually test it. |
internal/moreos/proc_linux_test.go
Outdated
require.NoError(t, err) | ||
|
||
// Make sure child process receives SIGTERM right after the parent process. | ||
require.True(t, c.ModTime().After(p.ModTime()) || c.ModTime().Equal(p.ModTime())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requires nanosecond precision to be strict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for giving this a go. I feel the technical implementation likely works :D I'd like to see more dents kicked out of the code prior to merge. Also, this needs to be tested on all operating systems, not just linux as the code could easily drift on platforms people aren't on at the moment.
I haven't yet thought of a cleaner way to define the function, but I think the current refactoring mentioned will not be throwaway until then either.. hope the early feedback helps!
Signed-off-by: Dhi Aurrahman <dio@rockybars.com>
cef7467
to
07df31d
Compare
Signed-off-by: Dhi Aurrahman <dio@rockybars.com>
@@ -155,6 +157,68 @@ func Test_EnsureProcessDone(t *testing.T) { | |||
require.NoError(t, EnsureProcessDone(cmd.Process)) | |||
} | |||
|
|||
func Test_EnsureChildProcessDone(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@codefromthecrypt could please help to check if the scenario is correct? Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The short of this one is I would expect the test function here to directly invoke fake_func-e (not intermediated though a go run
). and fake_func-e would invoke fake_envoy if it could or worst case cat
. Then I would expect this test case to wait for the fake_envoy to be ready (this part is easier than using cat as you can re-use our other code and get to a known position in the process). At that point, the test would grab the process tree of fake_func-e which should only have 2 processes itself and fake_envoy. The test would then kill -9 its child (fake_func-e) and then with some assert eventually stuff ensure the corresponding fake_envoy died, too, by using the pid it had stored earlier.
I would not necessarily expect the cmd used to run fake_func-e to use the same process attributes as the go test
process running it shouldn't have to be kill -9'ed, and it if did and it leaked a fake_func-e, that's sortof ok. The main thing is to only use the code under test (moreos.XXX stuff) where we are simulating its use.
We have to use compiled code because we can't really do a kill -9 test any other way at least not that I know of. I have a feeling when all this is done it really won't look complicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ps I keep saying fake envoy, but I sometimes wonder if I should be saying RequireFakeEnvoy
Have you seen it?
We already do this and handle compilation etc in a way the result is a bin we start directly. Concretely, I would expect this change to end up with a RequireFakeFuncE
which re-uses almost everything, just a different source file. That way maintenance is also easier to reason with as how we deal with fake binaries is consistent
@codefromthecrypt I'm rebooting this PR. I start with the test first. Please take a look if that's the expected scenario. Apparently, Windows is failing. Will do a check on Windows. |
Signed-off-by: Dhi Aurrahman <dio@rockybars.com>
@@ -140,7 +142,7 @@ func TestProcessGroupAttr_Interrupt(t *testing.T) { | |||
|
|||
func Test_EnsureProcessDone(t *testing.T) { | |||
// Fork a process that hangs | |||
cmd := exec.Command("cat" + Exe) | |||
cmd := exec.Command("sleep"+Exe, "1000") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using "cat", which seems "blocking"/"hanging", calling cmd.Wait
does not block the main thread (hence it is not validating the value of "killing" the process).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please roll this back. this is only testing that calling interrupt causes the process to complete and that calling it when a process is no longer there doesn't panic.
The cmd.Wait is commented as having undefined (platform specific) output. That's why it is not checked, and that is ok. The main thing is reduce flakes and this test hasn't flaked on any of our operating systems.
Let's please not change this variable because it confuses the topics, ex is a change here causing a regression elsewhere. ITOW roll back the replacement of cat with sleep and also rollback the timeout override as our tests have been very stable. If something is now not stable we should have an easier time understanding why as it would be a change here.
@@ -33,7 +35,24 @@ func interrupt(p *os.Process) error { | |||
} | |||
|
|||
func ensureProcessDone(p *os.Process) error { | |||
if err := p.Kill(); err != nil && err != os.ErrProcessDone { | |||
proc, err := process.NewProcess(int32(p.Pid)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is exactly why I was concerned with changing the tests here. We are changing code that is not related to the linux pinning thing. If we had to do this we really must do this in a separate change, so we can know if we are actually making things worse over what is a minor process control enhancement request. Notes to follow anyway, though.
// the error here. | ||
children, _ := proc.Children() | ||
for _, child := range children { | ||
if err := child.Kill(); err != nil && err != process.ErrorProcessNotRunning { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not our responsibility to kill the child processes of the process before we kill the process we are killing. We can actually make things worse by doing this. The process we are managing, for example, is envoy.. we definitely don't want to interfere with its process management. The "kill -9" killing children is really an edge case we are asking the OS to manage, and in this case we are asking it to "kill -9" envoy. If envoy doesn't manage downwards that's a separate topic, and not something I would expect us to want to try.
Specifically speaking this logic simply won't be called in kill -9, so definitely shouldn't be in this change. It raises the risk of the change by a lot and there's a lot of test burden here also that even if we wanted to do I doubt anyone would want to refactor. So, stepping back... are we trying to kill envoy's children before envoy? nope.. and since we aren't doing that, I don't think this logic should be in this change at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm focusing the "review" part on the test itself. I think the key missing here is that unless you remember hard what we are trying to do here, you won't know by looking at the code what is going on.
I'd recommend focusing literally only on what fake func-e would do, which is start fake envoy. That way the test main feels relatable.
Next would be then testing if indeed kill -9 that fake func-e if doing so ends up killing the child due to the process attributes and os thread pinning we setup.
There's no goal for a scenario this feels like, which is manually cleaning up sub-processes of fake-envoy (which there won't be as fake-envoy doesn't launch children). So, in this case actively choose to not solve that.
The main hope is that after the test passes or is known to not pass on particular operating systems, the original desired change could be added, which was to do the os thread pinning.
I hope this helps, my main goal is to ensure the project doesn't spin out of control again, we had so much low quality code and it took us months to get rid of it. Even good code costs us a lot to maintain, so we really owe it to ourselves to not over-solve or solve things that aren't related to the use case or outside our control.
If things feel like they are needed for tests to control the "fake func-e" properly, definitely let's not put those in the main source tree for the same reason. The worst place to add hairy code is the main tree, second worst is test.
Hope these guidelines help and appreciate these areas are somewhat subjective, but we are trying very hard to make func-e prioritize maintainability over all, and have to keep that in mind with every thing we do. Each if statement is burden, each untested loop is also. We can't afford to do things that end up requiring linear work.
) | ||
|
||
func main() { | ||
ctx, stop := signal.NotifyContext(context.Background(), syscall.SIGKILL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we are using notify context here? kill -9 can't be trapped. This line adds confusion due to this.. If we did use a trapping for another reason (ex ad-hoc testing), we might consider commenting that, and then using CommandContext
to start the child, or commenting why we aren't doing that.
func main() { | ||
ctx, stop := signal.NotifyContext(context.Background(), syscall.SIGKILL) | ||
defer stop() | ||
cmd := exec.Command("sleep"+moreos.Exe, "1000") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you choose to use sleep instead of cat, your tests get more complicated because you have to differentiate between the process dying because the magic 1s here vs dying otherwise. All this extra if, etc logic can push this change into the "not worth doing" category. I highly recommend not using sleep (we used to use sleep and switched)
Moreover, if this is supposed to be simulating func-e maybe it should call our fakeenvoy instead?
ctx, stop := signal.NotifyContext(context.Background(), syscall.SIGKILL) | ||
defer stop() | ||
cmd := exec.Command("sleep"+moreos.Exe, "1000") | ||
cmd.SysProcAttr = moreos.ProcessGroupAttr() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when the code doesn't look like func-e we have new questions like.. why is this using ProcessGroupAttr? and if it is, and this change is about the being fake-func-e, how about calling the main that, instead of making a main/main.go do like our other thing fake_envoy.go
and call this fake_func-e.go
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SOrry that missed the fake_envoy thing since I thought we can make it as narrow as possible for testing Pdeathsig, which I thought (in my mind) is a generalized way for making sure killing a parent kills children.
// There are 3 processes spawned by this test target, with the following relationship: | ||
// | ||
// runner (go run testdata/main/main.go) | ||
// └── main (the executable, compiled from testdata/main/main.go. This is a fake func-e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ps I will make comments like this on any and all commits, so going to do this here. In all naming and organization things, simplest first. If there are no subdirs needed (ex main/) don't add one. If you have to say X is Y, call it Y. This helps make the cognitive associations fast, and ends up with less lines and words per line to convey the same thing
make sense?
// └── main (the executable, compiled from testdata/main/main.go. This is a fake func-e) | |
// └── fake_func-e (the executable, compiled from testdata/fake_func-e.go) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Thanks!
Thanks, @codefromthecrypt for your time looking at this. I'll redo this. So, to clarify, do you want me to test specifically for So for testing, where do you want me to add it? proc_linux_test.go (or moreos_linux_test.go? by spawning |
I pushed another attempt here #371 that specifically tests sending SIGKIL to a fake |
Closing this in favor of #371. |
This ensures the child dies when the parent dies on Linux, by ensuring
the OS thread associated with the child process is not accidentally
killed by Go runtime.
Signed-off-by: Dhi Aurrahman dio@rockybars.com