Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubelet: invalid configuration: cgroup-root ["kubelet"] doesn't exist #1969

Closed
aojea opened this issue Dec 8, 2020 · 18 comments · Fixed by #1995
Closed

Kubelet: invalid configuration: cgroup-root ["kubelet"] doesn't exist #1969

aojea opened this issue Dec 8, 2020 · 18 comments · Fixed by #1995
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.

Comments

@aojea
Copy link
Contributor

aojea commented Dec 8, 2020

What happened:

Running a node-image built with kind 92f54895, in github actions, using kind 0.9.0 makes kubelet fail to start with the following error:

Dec 07 19:16:44 ovn-control-plane kubelet[994]: F1207 19:16:44.170173 994 server.go:269] failed to run Kubelet: invalid configuration: cgroup-root ["kubelet"] doesn't exist
Dec 07 19:16:44 ovn-control-plane kubelet[994]: goroutine 1 [running]:

How to reproduce it (as minimally and precisely as possible):

I've tried to reproduced it locally without success:

./kind.v0.9.0 create cluster --image aojea/kindnode:kind92f54895_k8se1c617a88ec

The failing job with the errors and logs is here
https://github.com/aojea/ovn-kubernetes/runs/1512853588?check_suite_focus=true

Bear in mind that github actions have its own docker and we had issues with KIND running there before

Client:
 Debug Mode: false
 Plugins:
  buildx: Build with BuildKit (Docker Inc., 0.4.2+azure)

Server:
 Containers: 3
  Running: 3
  Paused: 0
  Stopped: 0
 Images: 22
 Server Version: 19.03.13+azure
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 8fba4e9a7d01810a393d5d25a3621dc101981175
 runc version: ff819c7e9184c13b7c2607fe6c30ae19403a7aff
 init version: 
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 5.4.0-1031-azure
 Operating System: Ubuntu 18.04.5 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 2
 Total Memory: 6.791GiB
 Name: fv-az183-477
 ID: ZDHP:ZO2E:H6PK:BNIM:UC36:GA3L:TTM6:4HUU:GWDU:GGRT:AMDI:I2E7
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Username: githubactions
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
@aojea aojea added the kind/bug Categorizes issue or PR as related to a bug. label Dec 8, 2020
@BenTheElder BenTheElder added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Dec 9, 2020
@aojea
Copy link
Contributor Author

aojea commented Dec 10, 2020

We have a repro and a job to test we don´t regress #1965 (comment)

@BenTheElder
Copy link
Member

This really doesnt make sense. git diff HEAD..v0.9.0 -- ./pkg/cluster/internal/providers/docker/ is only network changes.

The image changed, yes. But changing kind binary versions shouldn't affect it, we basically haven't changed how we run nodes (except network setup) 🤔

@BenTheElder
Copy link
Member

I've been comparing the container inspect.jsons from the various CI results. Nothing shows up.

@BenTheElder BenTheElder self-assigned this Dec 12, 2020
@BenTheElder BenTheElder added the lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. label Dec 12, 2020
@BenTheElder
Copy link
Member

I can't repro locally:

[bentheelder@bentheelder:~/go/src/sigs.k8s.io/kind·2020-12-11T22:23:32-0800·@0ff469a7]
$ kind --version
kind version 0.9.0
[bentheelder@bentheelder:~/go/src/sigs.k8s.io/kind·2020-12-11T22:25:03-0800·@0ff469a7]
$ docker exec kind-control-plane ps aux | grep cgroup-root
root         688  3.7  0.1 1874900 89188 ?       Ssl  06:22   0:05 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --container-runtime=remote --container-runtime-endpoint=unix:///run/containerd/containerd.sock --fail-swap-on=false --node-ip=172.17.0.3 --provider-id=kind://docker/kind/kind-control-plane --fail-swap-on=false --cgroup-root=/kubelet

@aojea
Copy link
Contributor Author

aojea commented Dec 12, 2020

There are similar issues in kubernetes , it has to be related to cgroups

kubernetes/kubernetes#75832
kubernetes/kubernetes#49281
kubernetes/kubernetes#92175

@BenTheElder
Copy link
Member

those issues reference this flag but I'm not sure they're really related.

@BenTheElder
Copy link
Member

BenTheElder commented Dec 14, 2020

so I wasn't testing with 1.20 though (kubernetes HEAD is a bit ahead of that), probably not related but something to confirm.

EDIT: still works fine locally

@aojea
Copy link
Contributor Author

aojea commented Dec 15, 2020

so I wasn't testing with 1.20 though (kubernetes HEAD is a bit ahead of that), probably not related but something to confirm.

EDIT: still works fine locally

I´ve added more verbosity to the CI job and we have more info

Dec 15 11:07:14 kind-control-plane kubelet[978]: I1215 11:07:14.725431 978 server.go:470] Sending events to api server.
Dec 15 11:07:14 kind-control-plane kubelet[978]: I1215 11:07:14.725896 978 cgroup_manager_linux.go:294] The Cgroup [kubelet] has some missing paths: [/sys/fs/cgroup/hugetlb/kubelet /sys/fs/cgroup/cpu,cpuacct/kubelet /sys/fs/cgroup/memory/kubelet /sys/fs/cgroup/pids/kubelet /sys/fs/cgroup/cpuset/kubelet /sys/fs/cgroup/cpu,cpuacct/kubelet /sys/fs/cgroup/systemd/kubelet]
Dec 15 11:07:14 kind-control-plane kubelet[978]: F1215 11:07:14.725994 978 server.go:269] failed to run Kubelet: invalid configuration: cgroup-root ["kubelet"] doesn't exist
Dec 15 11:07:14 kind-control-plane kubelet[978]: goroutine 1 [running]:

@aojea
Copy link
Contributor Author

aojea commented Dec 15, 2020

ok, we are pregressing, it turns out is not an issue with the kind version that creates the cluster, it fails with all 😄

https://github.com/kubernetes-sigs/kind/runs/1557595731?check_suite_focus=true

@BenTheElder
Copy link
Member

fix inbound :-)

@BenTheElder
Copy link
Member

@aojea can you confirm this is resolved in your the downstream environment?

@AkihiroSuda
Copy link
Member

Which k/k commit is related to this?

AkihiroSuda added a commit to AkihiroSuda/kind that referenced this issue Jan 20, 2021
v2 support for `fix_cgroup()` was broken: kubernetes-sigs#2013

As CgroupNS is enabled by default on v2 hosts, we do not need to
mess around the cgroup mounts.

However, at least we need to create "/kubelet" cgroup (kubernetes-sigs#1969).

Fix kubernetes-sigs#2013

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
AkihiroSuda added a commit to AkihiroSuda/kind that referenced this issue Jan 20, 2021
v2 support for `fix_cgroup()` was broken: kubernetes-sigs#2013

As CgroupNS is enabled by default on v2 hosts, we do not need to
mess around the cgroup mounts.

However, at least we need to create "/kubelet" cgroup (kubernetes-sigs#1969).

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
AkihiroSuda added a commit to AkihiroSuda/kind that referenced this issue Jan 20, 2021
v2 support for `fix_cgroup()` was broken: kubernetes-sigs#2013

As CgroupNS is enabled by default on v2 hosts, we do not need to
mess around the cgroup mounts.

However, at least we need to create "/kubelet" cgroup (kubernetes-sigs#1969).

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
@BenTheElder
Copy link
Member

@AkihiroSuda this was a regression in a different iteration of the entrypoint / cgroups changes in this repo IIRC, not Kubernetes version or the kind binary.

@BenTheElder
Copy link
Member

aside: github adds a lot of visual noise when an issue is referenced in a commit, it will add an event to the issue every time any form of the commit is pushed/merged in any fork, including people merging in forks etc.

in kubernetes/kubernetes commits doing this are actually banned by automation.
https://github.com/kubernetes/test-infra/blob/55dab213bff9ed663013072c14f17c19dc9c625f/prow/plugins/invalidcommitmsg/invalidcommitmsg_test.go#L55

@AkihiroSuda
Copy link
Member

Sorry for noises 🙇

@BenTheElder
Copy link
Member

No worries, more of an FYI than anything, I think kind is small enough to not have major issues with it anyhow, but I've seen it get pretty out of hand on Kubernetes issues 🙃

That comment / code covers some other fun ones, the most ridiculous one on github's part IMHO is @user in commits actually causing an email to the user under the same circumstances (~push/merge in any fork) for users with typical notification settings 🔥

maelvls pushed a commit to maelvls/kind that referenced this issue Jul 1, 2021
v2 support for `fix_cgroup()` was broken: kubernetes-sigs#2013

As CgroupNS is enabled by default on v2 hosts, we do not need to
mess around the cgroup mounts.

However, at least we need to create "/kubelet" cgroup (kubernetes-sigs#1969).

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
@luhongzhen317
Copy link

Has the problem been solved?I have a similar problem

@YanzhaoLi
Copy link

Hit the same issue. But I checked my entrypoint which has already included the fix https://github.com/kubernetes-sigs/kind/pull/1995/files.

I have two kind clusters, one works well but another one hit this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Projects
None yet
5 participants