v1.1.6 regression: adding misc controller to cgroup v1 makes kubelet sad #3849

bcressey · 2023-04-27T05:06:28Z

Description

The most recent Bottlerocket release included an update to runc 1.1.6. Shortly after the release, we received reports of a regression where nodes would fall over after kubelet, systemd, and dbus-broker consumed excessive CPU and memory resources.

In bottlerocket-os/bottlerocket#3057 I narrowed this down via git bisect to e4ce94e which was meant to fix this issue, but instead now causes it to happen consistently.

I've confirmed that reverting that specific patch fixes the regression.

Steps to reproduce the issue

On an EKS 1.26 cluster with a single worker node, apply a consistent load via this spec:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: hello
spec:
  schedule: "* * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: hello
            image: busybox:1.28
            imagePullPolicy: IfNotPresent
            command:
            - /bin/sh
            - -c
            - date; echo Hello from the Kubernetes cluster; sleep $(( ( RANDOM % 10 )  + 1 ));
          restartPolicy: OnFailure
          hostNetwork: true
      parallelism: 100

(repro credit to @yeazelm)

After a short time, the "Path does not exist" and "Failed to delete cgroup paths" errors appear and continue even after the spec is deleted and the load is removed.

Describe the results you received and expected

systemd, kubelet, and dbus-broker all showed high CPU usage.

journalctl -f and busctl monitor showed these messages repeatedly:

Apr 27 00:20:09 ip-10-0-83-192.us-west-2.compute.internal kubelet[1206]: I0427 00:20:09.181437    1206 kubelet_getters.go:306] "Path does not exist" path="/var/lib/kubelet/pods/b08478e4-6c1b-461e-9fb5-e6c6411cf3ef/volumes"

‣ Type=signal  Endian=l  Flags=1  Version=1 Cookie=5417700  Timestamp="Thu 2023-04-27 00:21:08.887527 UTC"
  Sender=:1.0  Path=/org/freedesktop/systemd1  Interface=org.freedesktop.systemd1.Manager  Member=UnitNew
  UniqueName=:1.0
  MESSAGE "so" {
          STRING "kubepods-besteffort-podb08478e4_6c1b_461e_9fb5_e6c6411cf3ef.slice";
          OBJECT_PATH "/org/freedesktop/systemd1/unit/kubepods_2dbesteffort_2dpodb08478e4_5f6c1b_5f461e_5f9fb5_5fe6c6411cf3ef_2eslice";
  };

‣ Type=signal  Endian=l  Flags=1  Version=1 Cookie=5417701  Timestamp="Thu 2023-04-27 00:21:08.887570 UTC"
  Sender=:1.0  Path=/org/freedesktop/systemd1  Interface=org.freedesktop.systemd1.Manager  Member=JobNew
  UniqueName=:1.0
  MESSAGE "uos" {
          UINT32 454294;
          OBJECT_PATH "/org/freedesktop/systemd1/job/454294";
          STRING "kubepods-besteffort-podb08478e4_6c1b_461e_9fb5_e6c6411cf3ef.slice";
  };

‣ Type=method_call  Endian=l  Flags=0  Version=1 Cookie=449180  Timestamp="Thu 2023-04-27 00:21:10.622122 UTC"
  Sender=:1.11  Destination=org.freedesktop.systemd1  Path=/org/freedesktop/systemd1  Interface=org.freedesktop.systemd1.Manager  Member=StopUnit
  UniqueName=:1.11
  MESSAGE "ss" {
          STRING "kubepods-besteffort-podb08478e4_6c1b_461e_9fb5_e6c6411cf3ef.slice";
          STRING "replace";
  };

‣ Type=signal  Endian=l  Flags=1  Version=1 Cookie=5418558  Timestamp="Thu 2023-04-27 00:21:09.106241 UTC"
  Sender=:1.0  Path=/org/freedesktop/systemd1  Interface=org.freedesktop.systemd1.Manager  Member=UnitRemoved
  UniqueName=:1.0
  MESSAGE "so" {
          STRING "kubepods-besteffort-podb08478e4_6c1b_461e_9fb5_e6c6411cf3ef.slice";
          OBJECT_PATH "/org/freedesktop/systemd1/unit/kubepods_2dbesteffort_2dpodb08478e4_5f6c1b_5f461e_5f9fb5_5fe6c6411cf3ef_2eslice";
  };

What version of runc are you using?

# runc -v
runc version 1.1.6+bottlerocket
commit: 0f48801a0e21e3f0bc4e74643ead2a502df4818d
spec: 1.0.2-dev
go: go1.19.6
libseccomp: 2.5.4

Host OS information

Bottlerocket

Host kernel information

Bottlerocket releases cover a variety of kernels, so to break it down a bit:

Kubernetes 1.23, Linux 5.10, cgroup v1 - not affected
Kubernetes 1.24, Linux 5.15, cgroup v1 - affected
Kubernetes 1.25, Linux 5.15, cgroup v1 - affected
Kubernetes 1.26, Linux 5.15, cgroup v2 - not affected

The text was updated successfully, but these errors were encountered:

AkihiroSuda · 2023-04-27T05:28:42Z

Kubernetes 1.23, Linux 5.10, cgroup v1 - not affected
Kubernetes 1.24, Linux 5.15, cgroup v1 - affected

Could this be rather a regression on the kernel side?

bcressey · 2023-04-27T05:59:13Z

Could this be rather a regression on the kernel side?

Conceivably, although:

The runc 1.1.5 to 1.1.6 change was the only runtime (vs. first boot time) change in the affected Bottlerocket release
no kernel changes were included
the "misc" controller is not present in the 5.10 kernel, so only images using the 5.15 kernel would be affected

I am happy to help provide additional logs or dig deeper, since I can reproduce this on-demand easily.

What's puzzling to me is why this apparently fixed the Kubernetes issue for some users, but triggers it reliably here where we never saw the problem before.

It's possible it comes down to kernel config differences or systemd versions or something like that.

kolyshkin · 2023-04-28T21:53:21Z

I suspect that runc 1.1.6 binary creates misc cgroup, and then kubelet uses runc's libcontainer (of an older version) to remove it. That older libcontainer version doesn't know about misc (and systemd doesn't know about it either), so it's not removed.

Thus, bumping the runc/libcontainer dependency to 1.1.6 should fix this.

bcressey · 2023-04-29T05:05:12Z

@dims pointed me at #3028 which was helpful background for me.

Let me try to step through the timeline as I understand it:

kubernetes/kubernetes/issues/112124 is opened and narrowed down to the "misc" cgroup being added to the pod, preventing its cleanup.
Although runc is mentioned, prior to e4ce94e in 1.1.6, runc did not always join containers to the misc controller.
kubelet depends on runc's cgroup libraries, so in order to clean up pods using the "misc" controller, runc's cgroup library needed to be updated to be aware of it, so that kubelet's dependency could be updated.
Meanwhile, any host with a 5.13+ kernel that updates to runc 1.1.6 under the assumption that it's a stable bugfix release now experiences the original failure from (1).
kubelet can now include the updated cgroup library, so new releases of kubelet can address the problems in (1) and (4).

One way to frame it is:

kubelet uses runc's cgroup library, which is a fully supported use case, and since kubelet needed a library update in order to handle the unexpected "misc" controller, and since that change did not break the library API, it was appropriate for a stable release.

Another way might be:

End users install the latest binary release of runc alongside the latest binary release of their desired Kubernetes version. However, the latest runc binary breaks at least some of the latest Kubernetes binaries, given the inherent time lag involved in coordinating multiple project releases. Therefore the change was not appropriate for a stable release.

I don't intend to criticize, I'm just trying to understand the path forward. If the only safe path is to update runc and kubelet's runc dependency in lockstep, I can work with that. If they're expected to be independent, I can file bug reports if it ever turns out that they're not. Right now I'm not sure.

(Would it have been possible to add awareness of the "misc" controller to the cgroup library, so kubelet could handle its existence, without also changing the runc binary's default behavior to join that controller?)

dims · 2023-04-29T13:23:43Z

@bcressey right now the sequence of events for the record are:

runc makes a release
containerd switches to this new runc release ( runc binary used in tests .. https://github.com/containerd/containerd/blob/main/script/setup/runc-version )
k8s updates to runc release once containerd stabilizes ( both the vendor/ and the runc binary used in some CI jobs )

Also once a release of k8s is made, don't go updating the vendor-ed dependency usually (this current snafu could be an exception), but we may update the binary we test.

So currently what we tell folks is that a distro should probably follow the lead from k8s (and not jump ahead of the signals we get from the dozens of CI jobs across containerd and k8s).

bcressey · 2023-04-29T15:34:50Z

So currently what we tell folks is that a distro should probably follow the lead from k8s (and not jump ahead of the signals we get from the dozens of CI jobs across containerd and k8s).

@dims how does this work in a situation where the runc update also contains security fixes?

That's not the case with runc 1.1.6, but it did fix at least one concerning bug around adding containers to the proper cgroup, which is why we pulled it in to Bottlerocket.

So this is not that situation, exactly, but it's very much at the top of my mind, and not just a theoretical exercise.

dims · 2023-04-29T18:44:35Z

@dims how does this work in a situation where the runc update also contains security fixes?

@bcressey yes, this is a problem currently for sure.

kolyshkin · 2023-05-01T18:10:22Z

I don't intend to criticize, I'm just trying to understand the path forward. If the only safe path is to update runc and kubelet's runc dependency in lockstep, I can work with that. If they're expected to be independent, I can file bug reports if it ever turns out that they're not. Right now I'm not sure.

(Would it have been possible to add awareness of the "misc" controller to the cgroup library, so kubelet could handle its existence, without also changing the runc binary's default behavior to join that controller?)

To clarify:

a newer runc binary which is aware of misc controller + kubernetes with old runc/libcontainer which is not aware of misc controller is a problem.
an older runc binary + newer kubernetes is not a problem

In other words, there is no need for runc binary and runc/libcontainer to be totally in sync. However, if runc binary knows about misc but k8s don't, it's a problem (provided we have cgroup v1 and a sufficiently new kernel).

One way to fix this would be for runc/libcontainer's Destroy method to remove all controllers, known and unknown. The problem here is cgroup v1 hierarchy is a forest, not a tree, i.e. we have a bunch of paths like /sys/fs/cgroup/<controller>/<some/path>, and the list of known controllers is hardcoded.

There may be a way to fix this issue. Let me see...

kolyshkin · 2023-05-01T18:54:31Z

There may be a way to fix this issue. Let me see...

Such a way will still require updating runc/libcontainer in kubernetes, so it will not help the current problem, only the future ones that are similar. I am trying to justify if we should try to fix it or not.

Arguments for:

the issue is pretty bad, high CPU usage and tons of logs is no good;
it is likely that the future kernels will add more controllers, so a similar issue may happen again.

Arguments against:

cgroup v1 is going to be deprecated in a few years (fingers crossed);
the implementation does not look trivial.

kolyshkin · 2023-05-11T18:54:09Z

I spent more than half a day last week working on this, and it's not very easy. So, let's hope the kernel will not add more controllers any time soon.

In the meantime, here are the fixes for kubernetes:

dims · 2023-05-12T00:32:03Z

let's hope the kernel will not add more controllers any time soon.

🤞🏾 🤞🏾 🤞🏾 🤞🏾 🤞🏾 🤞🏾 🤞🏾

hakman · 2023-05-13T02:07:02Z

@kolyshkin To help others bumping into this issue, would be helpful to update the release notes for v1.1.6 to make it clear that it may / is a breaking change. Having "cgroup v1 drivers are now aware of misc controller" as a quick mention mention doesn't fully express the "sadness" it brings. 🥲

kolyshkin · 2023-05-22T17:41:18Z

@hakman thanks, I've added the "Known issues" section into https://github.com/opencontainers/runc/releases/tag/v1.1.6

kolyshkin · 2023-08-25T02:37:27Z

I believe this issue is addressed (as much as we can), so closing.

This improves daemon.log retention from a few hours to week (i.e. without log-rotation). This enables debugability support if customer stays on Kubernetes 1.24 or 1.25 for any period of time. This is a manual backport of PRs 110496 and 117241 by Kir Kolyshkin. Cherry-pick of: kubernetes/kubernetes#117892 ([1.24] vendor: bump runc to 1.1.6) Cherry-pick of: kubernetes/kubernetes#117682 ([1.25] vendor: bump runc to 1.1.6) This change is not required in Kubernetes 1.26 upwards. This issue comes from the integration of the Misc controller in Kernel 5.13 . Problem comes from a discrepancy between the code creating the "Misc" Cgroup and the code cleaning the Cgroup that doesn't handle the "Misc" Cgroup, leaving it behind. runc integrated a fix in 1.1.6, HOWEVER kubelet depends on runc's cgroup libraries. In order to clean up pods using the new "Misc" controller, runc cgroup library need to be updated to be aware of it. So even if our system run 1.1.6 but kubelet is not build with these library, the problem occurs. See more details at: https://www.suse.com/support/kb/doc/?id=000021270 opencontainers/runc#3849 (comment) Test Plan PASS: Install iso as AIO-SX PASS: Build-pkg 1.24 and 1.25 successfully. PASS: AIO-SX USM upgrade with K8S 1.24. Verify no log flood. PASS: AIO-SX: K8S Upgrade from 1.24 to 1.25, verify no log flood. PASS: make WHAT="cmd/kube-proxy cmd/kube-apiserver \ cmd/kube-controller-manager cmd/kubelet cmd/kubeadm \ cmd/kube-scheduler cmd/kubectl cmd/genman" Closes-Bug: 2098478 Change-Id: Iea84aceef74cad45f514bb987010e12880645582 Signed-off-by: Saba Touheed Mujawar <sabatouheed.mujawar@windriver.com>

AkihiroSuda added kind/performance regression area/cgroupv1 labels Apr 27, 2023

thaJeztah added this to the 1.1.7 milestone Apr 27, 2023

AkihiroSuda removed this from the 1.1.7 milestone Apr 27, 2023

dims mentioned this issue Apr 27, 2023

[runc][1.1.6] Adding misc controller to cgroup v1 makes kubelet sad kubernetes/kubernetes#117647

Closed

BenTheElder mentioned this issue May 12, 2023

[WIP] testing fixes ... kubernetes-sigs/kind#3219

Closed

BenTheElder mentioned this issue May 12, 2023

roll back runc to 1.1.5 kubernetes-sigs/kind#3220

Merged

hakman mentioned this issue May 12, 2023

Update runc to v1.1.7 kubernetes/kops#15375

Merged

BenTheElder mentioned this issue May 25, 2023

upgrading to runc 1.1.6 / 1.1.7 breaks kubernetes-sigs/kind#3223

Closed

eero-t mentioned this issue Aug 15, 2023

libct/cg/stats: support misc for cgroup v2 #3972

Merged

2 tasks

kolyshkin closed this as completed Aug 25, 2023

jaxesn mentioned this issue Jan 16, 2024

DO NOT MERGE: patches runc for al2023 kind image runing on al2 host aws/eks-anywhere-build-tooling#2821

Closed

zagg-bot bot mentioned this issue May 13, 2024

Link Checker Report zaggash/rancherkb-fuzz#2

Open

This was referenced Oct 21, 2024

🐘 Switch toopencontainers/runc as a library (Was: Explore replacing opencontainers/runc with containerd/cgroups) kubernetes/kubernetes#128157

Open

Support for exposing PSI metrics google/cadvisor#3083

Open

dims mentioned this issue Oct 22, 2024

vendor: bump runc/libcontainer to v1.2.1 kubernetes/kubernetes#128276

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.1.6 regression: adding misc controller to cgroup v1 makes kubelet sad #3849

v1.1.6 regression: adding misc controller to cgroup v1 makes kubelet sad #3849

bcressey commented Apr 27, 2023 •

edited

Loading

AkihiroSuda commented Apr 27, 2023

bcressey commented Apr 27, 2023

kolyshkin commented Apr 28, 2023

bcressey commented Apr 29, 2023

dims commented Apr 29, 2023

bcressey commented Apr 29, 2023

dims commented Apr 29, 2023

kolyshkin commented May 1, 2023

kolyshkin commented May 1, 2023

kolyshkin commented May 11, 2023

dims commented May 12, 2023

hakman commented May 13, 2023

kolyshkin commented May 22, 2023

kolyshkin commented Aug 25, 2023

v1.1.6 regression: adding misc controller to cgroup v1 makes kubelet sad #3849

v1.1.6 regression: adding misc controller to cgroup v1 makes kubelet sad #3849

Comments

bcressey commented Apr 27, 2023 • edited Loading

Description

Steps to reproduce the issue

Describe the results you received and expected

What version of runc are you using?

Host OS information

Host kernel information

AkihiroSuda commented Apr 27, 2023

bcressey commented Apr 27, 2023

kolyshkin commented Apr 28, 2023

bcressey commented Apr 29, 2023

dims commented Apr 29, 2023

bcressey commented Apr 29, 2023

dims commented Apr 29, 2023

kolyshkin commented May 1, 2023

kolyshkin commented May 1, 2023

kolyshkin commented May 11, 2023

dims commented May 12, 2023

hakman commented May 13, 2023

kolyshkin commented May 22, 2023

kolyshkin commented Aug 25, 2023

bcressey commented Apr 27, 2023 •

edited

Loading