HostPID Pod Container Cgroup path was residual after container restarts #4040

Burning1020 · 2023-09-28T01:35:44Z

Description

We created a HostPID pod that has shared pid namespace with host. The container process was killed and then restarted again and agian. We found that the container cgroup path under /sys/fs/cgroup/<subsystem>/kubepods/podxxx-xxx/<contaienrID>/ was left.

The reason is that runc kill or runc delete did not really wait for the exit of container children process, p.wait() will receive ECHILD immediately, see https://github.com/opencontainers/runc/blob/v1.1.9/libcontainer/init_linux.go#L585C18-L585C18. If any child process is still running, the cgroup path couldn't be removed.

Steps to reproduce the issue

Create a HostPID pod, the container has many children process died and new born.
Kill the main container process for many times.
Container will be restart again by kubelet.

Describe the results you received and expected

Expected: The container cgroup path is deleted.
Received: Still exist.

What version of runc are you using?

runc version 1.1.9
commit: v1.1.9-0-gccaecfcb
spec: 1.0.2-dev
go: go1.20.3
libseccomp: 2.5.4

Host OS information

No response

Host kernel information

No response

The text was updated successfully, but these errors were encountered:

kolyshkin · 2023-09-29T22:12:18Z

The code you referred to was recently changed in #3825.

In fact, runc kill or runc delete can't use wait(2) because those processes are not children of (this) runc invocation.

I think what happens is kubelet calls runc delete -f, and (in runc 1.1.x) it does not autodetect that this is a host pidns process and thus all the processes should be killed, not just init (this is solved in #3825 and will be available in runc 1.2).

The workaround is to call runc kill -a (for host pidns container) followed by runc delete, or wait till runc 1.2. One other option would be to backport #3825 to runc 1.1.x, but it's pretty large so I'd rather not do that.

kolyshkin · 2023-09-29T22:13:44Z

@Burning1020 If you can try with runc from main branch and check if it fixes your issue, that would be great!

lifubang · 2023-10-02T15:39:37Z

2. Kill the main container process for many times.

If we create a container with a shared or host PID namespace, after the init process has dead, the container's state becomes Stopped, so it will lead to ‘3. Container will be restart again by kubelet.’

Maybe we should trans this type container’s state to stopped by not only checking the init process has dead or not, but also checking whether there is no pid in the cgroup or not. Or just only the last condition.

lifubang · 2023-10-02T16:00:23Z

But I think Kubelet should delete the stopped container first and then start a new one. Did you see some error logs in k8s? Such as: ERRO[0000] Failed to remove paths: map[:/sys/fs/cgroup/unified/test blkio:/sys/fs/cgroup/blkio/user.slice/test cpu:/sys/fs/cgroup/cpu,cpuacct/user.slice/test cpuacct:/sys/fs/cgroup/cpu,cpuacct/user.slice/test cpuset:/sys/fs/cgroup/cpuset/test devices:/sys/fs/cgroup/devices/user.slice/test freezer:/sys/fs/cgroup/freezer/test hugetlb:/sys/fs/cgroup/hugetlb/test memory:/sys/fs/cgroup/memory/user.slice/user-1000.slice/session-8.scope/test misc:/sys/fs/cgroup/misc/test name=systemd:/sys/fs/cgroup/systemd/user.slice/user-1000.slice/session-8.scope/test net_cls:/sys/fs/cgroup/net_cls,net_prio/test net_prio:/sys/fs/cgroup/net_cls,net_prio/test perf_event:/sys/fs/cgroup/perf_event/test pids:/sys/fs/cgroup/pids/user.slice/user-1000.slice/session-8.scope/test rdma:/sys/fs/cgroup/rdma/test]

lifubang · 2023-10-02T16:07:37Z

and (in runc 1.1.x) it does not autodetect that this is a host pidns process and thus all the processes should be killed, not just init

I think it has detected:
https://github.com/opencontainers/runc/blob/v1.1.9/libcontainer/state_linux.go#L38-L44

kolyshkin · 2023-10-03T00:40:35Z

Maybe the source of the problem here is we treat the host pidns container with dead init as stopped, which is not quite true. So, a way to fix this is to modify runc to say the container is in running state once there are some processes left it its cgroup (and use runc kill -a (-a is not needed since runc 1.2) to actually kill those leftovers.

Burning1020 · 2023-10-08T09:48:00Z

@Burning1020 If you can try with runc from main branch and check if it fixes your issue, that would be great!

@kolyshkin I have tried the main branch,

# runc --version
runc version 1.1.0+dev
commit: v1.1.0-791-g90cbd11
spec: 1.1.0+dev
go: go1.18.5
libseccomp: 2.5.0

The bug is not resolved!

lifubang · 2023-10-08T09:57:06Z

The bug is not resolved!

Sorry, there is a bug for shared pid namespace in the main branch. Please see #4047 .
Do you have a detailed step to reproduce your issue? If you can reproduce it with Docker, it will be more better.

lifubang · 2023-10-08T09:59:31Z

Do you have a detailed step to reproduce your issue?

For example, what is the container's image the pod uses? The content of the pod's yaml description file.

Burning1020 · 2023-10-08T10:11:07Z

Do you have a detailed step to reproduce your issue?

For example, what is the container's image the pod uses? The content of the pod's yaml description file.

Reproduction is very simple.

Run a Pod under a Kubernetes cluster with HostPID=true, the container process should continuously fork child processes, you can use Kubernetes node-problem-detector for example.
Set the memory limit to a very small value, like 50Mb
The npd container was killed by OOM Killer for a few seconds and restarted by Kubelet but no pod was rebuilt.
Check the cgroup path

lifubang · 2023-10-08T11:33:35Z

I can't reproduce it in my test env. You mean the OOM killed container was deleted by kubelet, but the cgroup path still existed?

Burning1020 · 2023-10-09T01:34:43Z

@lifubang Yes, the purpose of creating a OOM killed container is to make the container main process died immediately(killed by SIGKILL), so that its children processes is still alive and then killed by signalAllProcesses. Because they couldn't be truely waited through the p.Wait() so maybe they are still alive even after the c.cgroupManager.Destroy(). That makes the cgroup path residual.

I think one of the key point in reproduction is to make the container main process continuously fork child processes.

Burning1020 · 2023-10-09T01:53:05Z

One more thing, I change the cgroup remove retries from 5 to 7, the bug is gone.

lifubang · 2023-10-09T04:47:01Z

I have reproduced this issue only with runc.

WARN[0079] Failed to remove cgroup (will retry)          error="rmdir /sys/fs/cgroup/net_cls,net_prio/test: device or resource busy"
WARN[0079] Failed to remove cgroup (will retry)          error="rmdir /sys/fs/cgroup/cpuset/test: device or resource busy"
WARN[0079] Failed to remove cgroup (will retry)          error="rmdir /sys/fs/cgroup/hugetlb/test: device or resource busy"
WARN[0079] Failed to remove cgroup (will retry)          error="rmdir /sys/fs/cgroup/rdma/test: device or resource busy"
WARN[0079] Failed to remove cgroup (will retry)          error="rmdir /sys/fs/cgroup/misc/test: device or resource busy"
WARN[0079] Failed to remove cgroup (will retry)          error="rmdir /sys/fs/cgroup/cpu,cpuacct/user.slice/test: device or resource busy"
WARN[0079] Failed to remove cgroup (will retry)          error="rmdir /sys/fs/cgroup/cpu,cpuacct/user.slice/test: device or resource busy"
WARN[0079] Failed to remove cgroup (will retry)          error="rmdir /sys/fs/cgroup/pids/user.slice/user-1000.slice/session-39.scope/test: device or resource busy"
WARN[0079] Failed to remove cgroup (will retry)          error="rmdir /sys/fs/cgroup/blkio/user.slice/test: device or resource busy"
WARN[0079] Failed to remove cgroup (will retry)          error="rmdir /sys/fs/cgroup/systemd/user.slice/user-1000.slice/session-39.scope/test: device or resource busy"
WARN[0079] Failed to remove cgroup (will retry)          error="rmdir /sys/fs/cgroup/unified/test: device or resource busy"
WARN[0079] Failed to remove cgroup (will retry)          error="rmdir /sys/fs/cgroup/devices/user.slice/test: device or resource busy"
WARN[0079] Failed to remove cgroup (will retry)          error="rmdir /sys/fs/cgroup/memory/user.slice/user-1000.slice/session-39.scope/test: device or resource busy"
WARN[0079] Failed to remove cgroup (will retry)          error="rmdir /sys/fs/cgroup/net_cls,net_prio/test: device or resource busy"
WARN[0079] Failed to remove cgroup (will retry)          error="rmdir /sys/fs/cgroup/perf_event/test: device or resource busy"
WARN[0079] Failed to remove cgroup (will retry)          error="rmdir /sys/fs/cgroup/freezer/test: device or resource busy"
ERRO[0079] Failed to remove cgroup                       error="rmdir /sys/fs/cgroup/cpuset/test: device or resource busy"
ERRO[0079] Failed to remove cgroup                       error="rmdir /sys/fs/cgroup/hugetlb/test: device or resource busy"
ERRO[0079] Failed to remove cgroup                       error="rmdir /sys/fs/cgroup/rdma/test: device or resource busy"
ERRO[0079] Failed to remove cgroup                       error="rmdir /sys/fs/cgroup/misc/test: device or resource busy"
ERRO[0079] Failed to remove cgroup                       error="rmdir /sys/fs/cgroup/systemd/user.slice/user-1000.slice/session-39.scope/test: device or resource busy"
ERRO[0079] Failed to remove cgroup                       error="rmdir /sys/fs/cgroup/unified/test: device or resource busy"
ERRO[0079] Failed to remove cgroup                       error="rmdir /sys/fs/cgroup/cpu,cpuacct/user.slice/test: device or resource busy"
ERRO[0079] Failed to remove cgroup                       error="rmdir /sys/fs/cgroup/cpu,cpuacct/user.slice/test: device or resource busy"
ERRO[0079] Failed to remove cgroup                       error="rmdir /sys/fs/cgroup/pids/user.slice/user-1000.slice/session-39.scope/test: device or resource busy"
ERRO[0079] Failed to remove cgroup                       error="rmdir /sys/fs/cgroup/blkio/user.slice/test: device or resource busy"
ERRO[0079] Failed to remove cgroup                       error="rmdir /sys/fs/cgroup/freezer/test: device or resource busy"
ERRO[0079] Failed to remove cgroup                       error="rmdir /sys/fs/cgroup/devices/user.slice/test: device or resource busy"
ERRO[0079] Failed to remove cgroup                       error="rmdir /sys/fs/cgroup/memory/user.slice/user-1000.slice/session-39.scope/test: device or resource busy"
ERRO[0079] Failed to remove cgroup                       error="rmdir /sys/fs/cgroup/net_cls,net_prio/test: device or resource busy"
ERRO[0079] Failed to remove cgroup                       error="rmdir /sys/fs/cgroup/perf_event/test: device or resource busy"
ERRO[0079] Failed to remove cgroup                       error="rmdir /sys/fs/cgroup/net_cls,net_prio/test: device or resource busy"
ERRO[0079] Failed to remove paths: map[:/sys/fs/cgroup/unified/test blkio:/sys/fs/cgroup/blkio/user.slice/test cpu:/sys/fs/cgroup/cpu,cpuacct/user.slice/test cpuacct:/sys/fs/cgroup/cpu,cpuacct/user.slice/test cpuset:/sys/fs/cgroup/cpuset/test devices:/sys/fs/cgroup/devices/user.slice/test freezer:/sys/fs/cgroup/freezer/test hugetlb:/sys/fs/cgroup/hugetlb/test memory:/sys/fs/cgroup/memory/user.slice/user-1000.slice/session-39.scope/test misc:/sys/fs/cgroup/misc/test name=systemd:/sys/fs/cgroup/systemd/user.slice/user-1000.slice/session-39.scope/test net_cls:/sys/fs/cgroup/net_cls,net_prio/test net_prio:/sys/fs/cgroup/net_cls,net_prio/test perf_event:/sys/fs/cgroup/perf_event/test pids:/sys/fs/cgroup/pids/user.slice/user-1000.slice/session-39.scope/test rdma:/sys/fs/cgroup/rdma/test]

But it's very difficulty to fix, because with shared pid namespace, we have no efficient way to know all container processes has exited or not，except reading pids from cgroup path. Do you have any ideas to fix this issue?
I think just only increasing retry times isn't the best way to fix this issue. We may need many steps:

Before removing cgroup path, we should check that there is no pid in cgroup;
Increase retry times;
If there is an error when removing cgroup dirs, never destroy the container resources and return the error immediately.

Burning1020 · 2023-10-10T02:06:04Z

Yes, it is difficult to fix. That's why I open this issue for disscusion.

Before removing cgroup path, we should check that there is no pid in cgroup;

Increase retry times;

If there is an error when removing cgroup dirs, never destroy the container resources and return the error immediately.

We should not keep waitting if any process still alive, as it is uncontrollable.
As you say, increasing retry times isn't an efficient way.
Maybe returning an error is a solution, let the upper caller decided how to handle this error, we should figure out why ignore
this error from the very begining: https://github.com/opencontainers/runc/blob/main/utils_linux.go#L86

Besides this, how about add the removel of cgroup to shim.

lifubang · 2023-10-10T05:09:11Z

3. Maybe returning an error is a solution

Yes, I also think so.

Now in the main branch, because the code of this area has been refactored, there are still some bugs for kill and delete a container with shared pid ns, so maybe this issue can't be fixed in release-1.1.
If you know how to fix it in release-1.1, please tell us or feel free to open a PR.

Burning1020 changed the title ~~Container Cgroup path was residual when~~ HostPID Pod Container Cgroup path was residual after container restarts Sep 28, 2023

kolyshkin mentioned this issue Jun 10, 2023

runc kill: add support for cgroup.kill #3825

Merged

This was referenced Oct 3, 2023

Fix a regression when killing and deleting a container with shared(host) pid namespace #4048

Closed

RFC: treat host pidns container with no init process as running if some processes exist in cgroup #4049

Closed

Burning1020 mentioned this issue Oct 23, 2023

return error if libcontainer desroy failed #4090

Closed

kolyshkin mentioned this issue Oct 31, 2023

Fix runc kill and runc delete for containers with no init and no private PID namespace #4102

Merged

cyphar closed this as completed in #4102 Nov 28, 2023

kolyshkin mentioned this issue Sep 10, 2024

libct: Signal: honor RootlessCgroups #4395

Merged

olderTaoist mentioned this issue Nov 29, 2024

The kubelet is unable to delete the cgroup path and not started static pods kubernetes/kubernetes#128975

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HostPID Pod Container Cgroup path was residual after container restarts #4040

HostPID Pod Container Cgroup path was residual after container restarts #4040

Burning1020 commented Sep 28, 2023 •

edited

Loading

kolyshkin commented Sep 29, 2023

kolyshkin commented Sep 29, 2023

lifubang commented Oct 2, 2023

lifubang commented Oct 2, 2023

lifubang commented Oct 2, 2023

kolyshkin commented Oct 3, 2023

Burning1020 commented Oct 8, 2023 •

edited

Loading

lifubang commented Oct 8, 2023

lifubang commented Oct 8, 2023 •

edited

Loading

Burning1020 commented Oct 8, 2023

lifubang commented Oct 8, 2023

Burning1020 commented Oct 9, 2023 •

edited

Loading

Burning1020 commented Oct 9, 2023

lifubang commented Oct 9, 2023

Burning1020 commented Oct 10, 2023

lifubang commented Oct 10, 2023

HostPID Pod Container Cgroup path was residual after container restarts #4040

HostPID Pod Container Cgroup path was residual after container restarts #4040

Comments

Burning1020 commented Sep 28, 2023 • edited Loading

Description

Steps to reproduce the issue

Describe the results you received and expected

What version of runc are you using?

Host OS information

Host kernel information

kolyshkin commented Sep 29, 2023

kolyshkin commented Sep 29, 2023

lifubang commented Oct 2, 2023

lifubang commented Oct 2, 2023

lifubang commented Oct 2, 2023

kolyshkin commented Oct 3, 2023

Burning1020 commented Oct 8, 2023 • edited Loading

lifubang commented Oct 8, 2023

lifubang commented Oct 8, 2023 • edited Loading

Burning1020 commented Oct 8, 2023

lifubang commented Oct 8, 2023

Burning1020 commented Oct 9, 2023 • edited Loading

Burning1020 commented Oct 9, 2023

lifubang commented Oct 9, 2023

Burning1020 commented Oct 10, 2023

lifubang commented Oct 10, 2023

Burning1020 commented Sep 28, 2023 •

edited

Loading

Burning1020 commented Oct 8, 2023 •

edited

Loading

lifubang commented Oct 8, 2023 •

edited

Loading

Burning1020 commented Oct 9, 2023 •

edited

Loading