Skip to content
This repository has been archived by the owner on May 6, 2020. It is now read-only.

agent: Make the agent the subreaper of all processes #178

Merged
merged 2 commits into from
Dec 23, 2017

Conversation

sboeuf
Copy link
Contributor

@sboeuf sboeuf commented Dec 5, 2017

agent: Make the agent the subreaper of all processes

Our agent had a flaw, it potentially lives with some zombie processes
according to what has been done by the container user. The explanation
is that we only wait for container and exec processes but we don't
handle the children and grandchildren processes coming from those
initial container and exec processes. The example of such a case can
be shown demonstrated in case the container process spawn a child
process, and this one will also spawn another child process. In case
the child terminates before its own children, those ones expect to be
reaped by the init process of the PID namespace, i.e the container
process. But this container process is not written to reap descendant
processes, which will lead to a zombie process left behind.

This commit solves those cases by setting the agent process, who will
be the father of all container processes, as a subreaper. This way, it
will receive all the SIGCHLD signals from any child left behind, and
will reap them. Notice this patch makes sure we don't check the error
from the libcontainer call to process.Wait() since it won't be able to
reap the container or exec processes as expected. This libcontainer
Wait() function is only used here to properly cleanup the pipes and
internal structures created by libcontainer.

Fixes #177

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>

@sboeuf
Copy link
Contributor Author

sboeuf commented Dec 5, 2017

Fixes #177

@sboeuf sboeuf force-pushed the sboeuf/implement_subreaper branch 3 times, most recently from 386b651 to 52a8fa9 Compare December 11, 2017 22:58
@sboeuf
Copy link
Contributor Author

sboeuf commented Dec 11, 2017

Cannot be merged because of libcontainer issue: opencontainers/runc#1677

@sboeuf sboeuf force-pushed the sboeuf/implement_subreaper branch from 52a8fa9 to 336e6e1 Compare December 15, 2017 20:43
@sboeuf
Copy link
Contributor Author

sboeuf commented Dec 15, 2017

opencontainers/runc#1677 has been fixed with opencontainers/runc#1678 which has been merged !
This PR can be unblocked now.

@sboeuf sboeuf force-pushed the sboeuf/implement_subreaper branch 2 times, most recently from f1dd79c to c36a44a Compare December 18, 2017 20:00
@sboeuf
Copy link
Contributor Author

sboeuf commented Dec 18, 2017

@sameo @amshinde please take a look.

@sboeuf
Copy link
Contributor Author

sboeuf commented Dec 20, 2017

@jodh-intel @grahamwhaley @jcvenegas @sameo @amshinde
Please take a look :)

@devimc
Copy link

devimc commented Dec 20, 2017

I wonder if runC has the same issue

In case the child terminates before its own children, those ones expect to be reaped by the init process of the PID namespace

yee, but

If the "init" process of a PID namespace terminates, the kernel terminates all of the processes in the namespace via a SIGKILL signal (pid_namespaces(7))

basically once the workload ends, all processes are killed, but I guess you want to handle the case where the workload never ends and its children spawn and leave processes running

lgtm

Approved with PullApprove Approved with PullApprove

@sboeuf
Copy link
Contributor Author

sboeuf commented Dec 20, 2017

@devimc, runc does not have the same issue because it actually set itself as a subreaper (there a flag to disable the subreaper if needed).
And yes you're right, I am trying to handle the case where the container process does not end, but the workload has started a bunch of children and grandchildren who might terminate after their parent.

reaper.go Outdated
sync.RWMutex

chansLock sync.RWMutex
exitCodeChans map[int]chan int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment explaining this structure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I can.

reaper.go Outdated
agentLog.Infof("SIGCHLD pid %d, status %d", pid, status)

exitCodeCh, err := r.getExitCodeCh(pid)
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make it clear in the comments that we are only interested in the exit code of the Container and Exec processes, and ignoring the exit code from children and grandchildren(just reaping them)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes good point, I'll try to add more comments !

agent.go Outdated
@@ -856,12 +895,36 @@ func (p *pod) runContainerProcess(cid, pid string, terminal bool, started chan e

fieldLogger := agentLog.WithField("container-pid", pid)

// This lock is very important to avoid any race with reaper.reap().
// Indeed, if we don't lock this here, we could potentially get the
// SIGCHLD signal before the channel has been created, meaning we will
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you mention exit code channel to be more clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

}
// Close pipes to terminate routeOutput() go routines.
ctr.closeProcessPipes(pid)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not required anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sebastien Boeuf added 2 commits December 21, 2017 09:49
We need the libcontainer vendoring to be updated so that we can rely
on a recent fix preventing libcontainer code to wait on signalled
processes in case a subreaper is already set on the system.

Fixes #177

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Our agent had a flaw, it potentially lives with some zombie processes
according to what has been done by the container user. The explanation
is that we only wait for container and exec processes but we don't
handle the children and grandchildren processes coming from those
initial container and exec processes. The example of such a case can
be shown demonstrated in case the container process spawn a child
process, and this one will also spawn another child process. In case
the child terminates before its own children, those ones expect to be
reaped by the init process of the PID namespace, i.e the container
process. But this container process is not written to reap descendant
processes, which will lead to a zombie process left behind.

This commit solves those cases by setting the agent process, who will
be the father of all container processes, as a subreaper. This way, it
will receive all the SIGCHLD signals from any child left behind, and
will reap them. Notice this patch makes sure we don't check the error
from the libcontainer call to process.Wait() since it won't be able to
reap the container or exec processes as expected. This libcontainer
Wait() function is only used here to properly cleanup the pipes and
internal structures created by libcontainer.

Fixes #177

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
@sboeuf sboeuf force-pushed the sboeuf/implement_subreaper branch from c36a44a to bd925f1 Compare December 21, 2017 18:18
@sboeuf
Copy link
Contributor Author

sboeuf commented Dec 21, 2017

@amshinde I have updated the PR with more comments.

Copy link
Contributor

@amshinde amshinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@sboeuf
Copy link
Contributor Author

sboeuf commented Dec 23, 2017

@amshinde Could you merge if you're fine with this ?

@amshinde amshinde merged commit e5fb70c into master Dec 23, 2017
@sboeuf sboeuf deleted the sboeuf/implement_subreaper branch January 10, 2018 21:43
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants