Scrape ps
output to try and kill any grandchild processes when stopping a process
#1502
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We also had problems with orphaned grandchild processes and that processes are still running even after stopping supervisor child processes. Here is a possible solution I want to share and I am happy for any feedback.
Use Case/ Context
Stopping a process via CLI or supervisor UI should stop the process and all it's child processes reliable. If the processes refuse to stop (e.g. not reacting to signals), they need to be forcefully stopped after some timeout. Additionally only direct child's of supervisor are properly stopped. Grandchild processes are not properly stopped and net be signaled as well.
Current Behavior
Supervisor handles different signals as stated here: http://supervisord.org/running.html#signals When a TERM, INT or QUIT signal is received, supervisor will trigger the shutdown behavior. This signals can also be sent to the process group, but not all processes started from a program are necessarily in the same process group, as mentioned here: #199 (comment). Supervisor only knows about the direct child processes that it started but not any processes which any of the child processes spawned.
Possible Solution:
This stopping behavior still needs some improvements.
Stopping a process and all it's child processes reliable also needs a check if the processes are actually stopped and if the processes refuse to stop (e.g. not reacting to signals), they need to be forcefully stopped after some timeout via SIGKILL signal. There are a couple of things to consider:
think about a good timeout. Systemd uses about 2-3 mins, for my use-cases 60 seconds should be sufficient.
SIGKILL can be very intrusive. For example when processes are killed which are doing some operations on a database, this can cause loss of data.
There should be an additional config parameter such as disable_force_shutdown_behavior which would disable the mechanism of sending SIGKILL to the process and its childs.
We should define our usual behavior and allow users to alter this behavior. What's sensible is a default behavior -- I would say a round of SIGTERM to children, and after some time sending a SIGKILL to those who still alive is the ideal solution.
Note, however, that sometimes process can "hang" in the kernel (or as they call it D-state, see ps man page). It's rare in a healthy system, but still possible. For those situation, I would say we should report it to the user somehow.
Related to #1101