-
Notifications
You must be signed in to change notification settings - Fork 12
agent: Make the agent the subreaper of all processes #178
Conversation
sboeuf
commented
Dec 5, 2017
Fixes #177 |
386b651
to
52a8fa9
Compare
Cannot be merged because of libcontainer issue: opencontainers/runc#1677 |
52a8fa9
to
336e6e1
Compare
opencontainers/runc#1677 has been fixed with opencontainers/runc#1678 which has been merged ! |
f1dd79c
to
c36a44a
Compare
@jodh-intel @grahamwhaley @jcvenegas @sameo @amshinde |
I wonder if runC has the same issue
yee, but
basically once the workload ends, all processes are killed, but I guess you want to handle the case where the workload never ends and its children spawn and leave processes running lgtm |
@devimc, |
reaper.go
Outdated
sync.RWMutex | ||
|
||
chansLock sync.RWMutex | ||
exitCodeChans map[int]chan int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment explaining this structure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure I can.
reaper.go
Outdated
agentLog.Infof("SIGCHLD pid %d, status %d", pid, status) | ||
|
||
exitCodeCh, err := r.getExitCodeCh(pid) | ||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make it clear in the comments that we are only interested in the exit code of the Container and Exec processes, and ignoring the exit code from children and grandchildren(just reaping them)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes good point, I'll try to add more comments !
agent.go
Outdated
@@ -856,12 +895,36 @@ func (p *pod) runContainerProcess(cid, pid string, terminal bool, started chan e | |||
|
|||
fieldLogger := agentLog.WithField("container-pid", pid) | |||
|
|||
// This lock is very important to avoid any race with reaper.reap(). | |||
// Indeed, if we don't lock this here, we could potentially get the | |||
// SIGCHLD signal before the channel has been created, meaning we will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you mention exit code channel to be more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.
} | ||
// Close pipes to terminate routeOutput() go routines. | ||
ctr.closeProcessPipes(pid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not required anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need the libcontainer vendoring to be updated so that we can rely on a recent fix preventing libcontainer code to wait on signalled processes in case a subreaper is already set on the system. Fixes #177 Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Our agent had a flaw, it potentially lives with some zombie processes according to what has been done by the container user. The explanation is that we only wait for container and exec processes but we don't handle the children and grandchildren processes coming from those initial container and exec processes. The example of such a case can be shown demonstrated in case the container process spawn a child process, and this one will also spawn another child process. In case the child terminates before its own children, those ones expect to be reaped by the init process of the PID namespace, i.e the container process. But this container process is not written to reap descendant processes, which will lead to a zombie process left behind. This commit solves those cases by setting the agent process, who will be the father of all container processes, as a subreaper. This way, it will receive all the SIGCHLD signals from any child left behind, and will reap them. Notice this patch makes sure we don't check the error from the libcontainer call to process.Wait() since it won't be able to reap the container or exec processes as expected. This libcontainer Wait() function is only used here to properly cleanup the pipes and internal structures created by libcontainer. Fixes #177 Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
c36a44a
to
bd925f1
Compare
@amshinde I have updated the PR with more comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
@amshinde Could you merge if you're fine with this ? |