Kubelet Failed to Start After Node Restart (cri_dockerd_enabled: true) #8734

sashokbg · 2022-04-20T07:29:56Z

Environment:

Bare metal, amdx64
OS (printf "$(uname -srm)\n$(cat /etc/os-release)\n"):
Linux 5.13.0-39-generic x86_64
NAME="Ubuntu"
VERSION="20.04.4 LTS (Focal Fossa)"
Version of Ansible (ansible --version):
ansible [core 2.12.3]
config file = None
configured module search path = ['/home/alexander/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
ansible python module location = /home/alexander/.local/lib/python3.10/site-packages/ansible
ansible collection location = /home/alexander/.ansible/collections:/usr/share/ansible/collections
executable location = /home/alexander/.local/bin/ansible
python version = 3.10.4 (main, Mar 23 2022, 23:05:40) [GCC 11.2.0]
jinja version = 2.11.3
libyaml = True
Version of Python (python --version):
Python 3.10.4

Kubespray version (commit) (git rev-parse --short HEAD):
dc0dfad4

Network plugin used:
flannel

Full inventory with variables (ansible -i inventory/sample/inventory.ini all -m debug -a "var=hostvars[inventory_hostname]"):
https://gist.github.com/sashokbg/cafb0c6b1a264d72febbc797b865afd8

Command used to invoke ansible:
ansible-playbook -i inventory/home_cloud_cluster/hosts.yaml --become --become-user=root cluster.yml --private-key ~/.ssh/id_rsa --become-user=root --user home-cloud-user

Output of ansible run:
Unfortunately I don't have it, but script finished with no errors.
Should I reinstall it for the purpose of obtaining more info ?

Anything else do we need to know:
I have cri_dockerd_enabled: true in my inventory and after restarting my control plane node the kubelet service is unable to restart.

Journalctl logs of kubelet

avril 21 00:28:04 node1 kubelet[5811]: I0421 00:28:04.961179    5811 container_manager_linux.go:286] "Creating Container Manager object based on Node Config" nodeConfig={RuntimeCgroupsName:/systemd/system.slice SystemCgroupsName: KubeletCgroupsName:/systemd/system.slice ContainerRuntime:remote CgroupsPerQOS:true CgroupRoot:/ CgroupDriver:systemd KubeletRootDir:/var/lib/kubelet ProtectKernelDefaults:true NodeAllocatableConfig:{KubeReservedCgroupName: SystemReservedCgroupName: ReservedSystemCPUs: EnforceNodeAllocatable:map[pods:{}] KubeReserved:map[cpu:{i:{value:200 scale:-3} d:{Dec:<nil>} s:200m Format:DecimalSI} memory:{i:{value:536870912 scale:0} d:{Dec:<nil>} s: Format:BinarySI}] SystemReserved:map[] HardEvictionThresholds:[{Signal:nodefs.available Operator:LessThan Value:{Quantity:<nil> Percentage:0.1} GracePeriod:0s MinReclaim:<nil>} {Signal:nodefs.inodesFree Operator:LessThan Value:{Quantity:<nil> Percentage:0.05} GracePeriod:0s MinReclaim:<nil>} {Signal:imagefs.available Operator:LessThan Value:{Quantity:<nil> Percentage:0.15} GracePeriod:0s MinReclaim:<nil>} {Signal:memory.available Operator:LessThan Value:{Quantity:100Mi Percentage:0} GracePeriod:0s MinReclaim:<nil>}]} QOSReserved:map[] ExperimentalCPUManagerPolicy:none ExperimentalCPUManagerPolicyOptions:map[] ExperimentalTopologyManagerScope:container ExperimentalCPUManagerReconcilePeriod:10s ExperimentalMemoryManagerPolicy:None ExperimentalMemoryManagerReservedMemory:[] ExperimentalPodPidsLimit:-1 EnforceCPULimits:true CPUCFSQuotaPeriod:100ms ExperimentalTopologyManagerPolicy:none}
avril 21 00:28:04 node1 kubelet[5811]: I0421 00:28:04.961210    5811 topology_manager.go:133] "Creating topology manager with policy per scope" topologyPolicyName="none" topologyScopeName="container"
avril 21 00:28:04 node1 kubelet[5811]: I0421 00:28:04.961223    5811 container_manager_linux.go:321] "Creating device plugin manager" devicePluginEnabled=true
avril 21 00:28:04 node1 kubelet[5811]: I0421 00:28:04.961237    5811 manager.go:141] "Creating Device Plugin manager" path="/var/lib/kubelet/device-plugins/kubelet.sock"
avril 21 00:28:04 node1 kubelet[5811]: I0421 00:28:04.961265    5811 state_mem.go:36] "Initialized new in-memory state store"
avril 21 00:28:04 node1 kubelet[5811]: W0421 00:28:04.961518    5811 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {/var/run/cri-dockerd.sock /var/run/cri-dockerd.sock <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix /var/run/cri-dockerd.sock: connect: no such file or directory". Reconnecting...
avril 21 00:28:04 node1 kubelet[5811]: E0421 00:28:04.961616    5811 server.go:302] "Failed to run kubelet" err="failed to run Kubelet: unable to determine runtime API version: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /var/run/cri-dockerd.sock: connect: no such file or directory\""
avril 21 00:28:04 node1 systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE
avril 21 00:28:04 node1 systemd[1]: kubelet.service: Failed with result 'exit-code'.

The text was updated successfully, but these errors were encountered:

sashokbg · 2022-04-20T22:38:35Z

I managed to manually get my cluster running again by enabling and starting the cri-dockerd service in systemd.

sudo systemctl enable cri-dockerd.service
sudo systemctl restart cri-dockerd.service

I will try to take a look at the ansible roles and see why it wasn't by default

cristicalin · 2022-04-27T13:55:09Z

This is an issue with the reset play in general, when resetting services it masks them. There was a fix proposed for containerd a few days ago but I'm guessing the issue is more widespread and should be addressed generically.

The PR in question: #8726

rickerc · 2022-04-27T15:50:25Z

The issue is I think specific to apt-based systems and for runtimes where the Ansible for the container runtime does some variant of

stop apt-installed runtime
uninstall apt-installed runtime
download known good binary
register known good as systemd service
systemd start systemd service then fails because step 2 above left service masked

I added an unmask for cri-docker and docker to #8726 which will probably fix this. The container runtime install plays aren't all that cookie cutter so there may be additional tweaks needed....

CharudathGopal · 2022-07-14T19:36:13Z

JFYI, #8726 did not help.

Locally made below changes which seems to be helping, please take a look and suggest if its fine.

Added task to ensure cri-dockerd is started and enabled [roles/container-engine/cri-dockerd/tasks/main.yml]

- name: Ensure cri-dockerd is started and enabled
  systemd:
    name: cri-dockerd
    daemon_reload: yes
    enabled: yes
    state: started

Updated the cri-dockerd handler to below [roles/container-engine/cri-dockerd/handlers/main.yml]

- name: cri-dockerd | reload systemd
  systemd:
    name: cri-dockerd
    daemon_reload: true
    masked: no
    enabled: yes

k8s-triage-robot · 2022-10-12T20:28:33Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-11-11T20:54:41Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2022-12-11T21:18:13Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2022-12-11T21:18:17Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sashokbg added the kind/bug Categorizes issue or PR as related to a bug. label Apr 20, 2022

sashokbg changed the title ~~RestartPolicy set to "No" with Docker Container Manager / Bare Metal~~ Kubelet Failed to Start After Node Restart Apr 20, 2022

sashokbg changed the title ~~Kubelet Failed to Start After Node Restart~~ Kubelet Failed to Start After Node Restart (cri_dockerd_enabled: true) Apr 20, 2022

cristicalin mentioned this issue Apr 27, 2022

Ensure containerd service unmasking #8726

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 12, 2022

zainul1114 mentioned this issue Oct 14, 2022

The connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port? #9375

Closed

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 11, 2022

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubelet Failed to Start After Node Restart (cri_dockerd_enabled: true) #8734

Kubelet Failed to Start After Node Restart (cri_dockerd_enabled: true) #8734

sashokbg commented Apr 20, 2022 •

edited

Loading

sashokbg commented Apr 20, 2022

cristicalin commented Apr 27, 2022 •

edited

Loading

rickerc commented Apr 27, 2022

CharudathGopal commented Jul 14, 2022

k8s-triage-robot commented Oct 12, 2022

k8s-triage-robot commented Nov 11, 2022

k8s-triage-robot commented Dec 11, 2022

k8s-ci-robot commented Dec 11, 2022

Kubelet Failed to Start After Node Restart (cri_dockerd_enabled: true) #8734

Kubelet Failed to Start After Node Restart (cri_dockerd_enabled: true) #8734

Comments

sashokbg commented Apr 20, 2022 • edited Loading

sashokbg commented Apr 20, 2022

cristicalin commented Apr 27, 2022 • edited Loading

rickerc commented Apr 27, 2022

CharudathGopal commented Jul 14, 2022

k8s-triage-robot commented Oct 12, 2022

k8s-triage-robot commented Nov 11, 2022

k8s-triage-robot commented Dec 11, 2022

k8s-ci-robot commented Dec 11, 2022

sashokbg commented Apr 20, 2022 •

edited

Loading

cristicalin commented Apr 27, 2022 •

edited

Loading