Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubelet Failed to Start After Node Restart (cri_dockerd_enabled: true) #8734

Closed
sashokbg opened this issue Apr 20, 2022 · 8 comments
Closed
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@sashokbg
Copy link

sashokbg commented Apr 20, 2022

Environment:

  • Bare metal, amdx64

  • OS (printf "$(uname -srm)\n$(cat /etc/os-release)\n"):
    Linux 5.13.0-39-generic x86_64
    NAME="Ubuntu"
    VERSION="20.04.4 LTS (Focal Fossa)"

  • Version of Ansible (ansible --version):
    ansible [core 2.12.3]
    config file = None
    configured module search path = ['/home/alexander/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
    ansible python module location = /home/alexander/.local/lib/python3.10/site-packages/ansible
    ansible collection location = /home/alexander/.ansible/collections:/usr/share/ansible/collections
    executable location = /home/alexander/.local/bin/ansible
    python version = 3.10.4 (main, Mar 23 2022, 23:05:40) [GCC 11.2.0]
    jinja version = 2.11.3
    libyaml = True

  • Version of Python (python --version):
    Python 3.10.4

Kubespray version (commit) (git rev-parse --short HEAD):
dc0dfad4

Network plugin used:
flannel

Full inventory with variables (ansible -i inventory/sample/inventory.ini all -m debug -a "var=hostvars[inventory_hostname]"):
https://gist.github.com/sashokbg/cafb0c6b1a264d72febbc797b865afd8

Command used to invoke ansible:
ansible-playbook -i inventory/home_cloud_cluster/hosts.yaml --become --become-user=root cluster.yml --private-key ~/.ssh/id_rsa --become-user=root --user home-cloud-user

Output of ansible run:
Unfortunately I don't have it, but script finished with no errors.
Should I reinstall it for the purpose of obtaining more info ?

Anything else do we need to know:
I have cri_dockerd_enabled: true in my inventory and after restarting my control plane node the kubelet service is unable to restart.

Journalctl logs of kubelet

avril 21 00:28:04 node1 kubelet[5811]: I0421 00:28:04.961179    5811 container_manager_linux.go:286] "Creating Container Manager object based on Node Config" nodeConfig={RuntimeCgroupsName:/systemd/system.slice SystemCgroupsName: KubeletCgroupsName:/systemd/system.slice ContainerRuntime:remote CgroupsPerQOS:true CgroupRoot:/ CgroupDriver:systemd KubeletRootDir:/var/lib/kubelet ProtectKernelDefaults:true NodeAllocatableConfig:{KubeReservedCgroupName: SystemReservedCgroupName: ReservedSystemCPUs: EnforceNodeAllocatable:map[pods:{}] KubeReserved:map[cpu:{i:{value:200 scale:-3} d:{Dec:<nil>} s:200m Format:DecimalSI} memory:{i:{value:536870912 scale:0} d:{Dec:<nil>} s: Format:BinarySI}] SystemReserved:map[] HardEvictionThresholds:[{Signal:nodefs.available Operator:LessThan Value:{Quantity:<nil> Percentage:0.1} GracePeriod:0s MinReclaim:<nil>} {Signal:nodefs.inodesFree Operator:LessThan Value:{Quantity:<nil> Percentage:0.05} GracePeriod:0s MinReclaim:<nil>} {Signal:imagefs.available Operator:LessThan Value:{Quantity:<nil> Percentage:0.15} GracePeriod:0s MinReclaim:<nil>} {Signal:memory.available Operator:LessThan Value:{Quantity:100Mi Percentage:0} GracePeriod:0s MinReclaim:<nil>}]} QOSReserved:map[] ExperimentalCPUManagerPolicy:none ExperimentalCPUManagerPolicyOptions:map[] ExperimentalTopologyManagerScope:container ExperimentalCPUManagerReconcilePeriod:10s ExperimentalMemoryManagerPolicy:None ExperimentalMemoryManagerReservedMemory:[] ExperimentalPodPidsLimit:-1 EnforceCPULimits:true CPUCFSQuotaPeriod:100ms ExperimentalTopologyManagerPolicy:none}
avril 21 00:28:04 node1 kubelet[5811]: I0421 00:28:04.961210    5811 topology_manager.go:133] "Creating topology manager with policy per scope" topologyPolicyName="none" topologyScopeName="container"
avril 21 00:28:04 node1 kubelet[5811]: I0421 00:28:04.961223    5811 container_manager_linux.go:321] "Creating device plugin manager" devicePluginEnabled=true
avril 21 00:28:04 node1 kubelet[5811]: I0421 00:28:04.961237    5811 manager.go:141] "Creating Device Plugin manager" path="/var/lib/kubelet/device-plugins/kubelet.sock"
avril 21 00:28:04 node1 kubelet[5811]: I0421 00:28:04.961265    5811 state_mem.go:36] "Initialized new in-memory state store"
avril 21 00:28:04 node1 kubelet[5811]: W0421 00:28:04.961518    5811 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {/var/run/cri-dockerd.sock /var/run/cri-dockerd.sock <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix /var/run/cri-dockerd.sock: connect: no such file or directory". Reconnecting...
avril 21 00:28:04 node1 kubelet[5811]: E0421 00:28:04.961616    5811 server.go:302] "Failed to run kubelet" err="failed to run Kubelet: unable to determine runtime API version: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial unix /var/run/cri-dockerd.sock: connect: no such file or directory\""
avril 21 00:28:04 node1 systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE
avril 21 00:28:04 node1 systemd[1]: kubelet.service: Failed with result 'exit-code'.
@sashokbg sashokbg added the kind/bug Categorizes issue or PR as related to a bug. label Apr 20, 2022
@sashokbg sashokbg changed the title RestartPolicy set to "No" with Docker Container Manager / Bare Metal Kubelet Failed to Start After Node Restart Apr 20, 2022
@sashokbg sashokbg changed the title Kubelet Failed to Start After Node Restart Kubelet Failed to Start After Node Restart (cri_dockerd_enabled: true) Apr 20, 2022
@sashokbg
Copy link
Author

I managed to manually get my cluster running again by enabling and starting the cri-dockerd service in systemd.

sudo systemctl enable cri-dockerd.service
sudo systemctl restart cri-dockerd.service

I will try to take a look at the ansible roles and see why it wasn't by default

@cristicalin
Copy link
Contributor

cristicalin commented Apr 27, 2022

This is an issue with the reset play in general, when resetting services it masks them. There was a fix proposed for containerd a few days ago but I'm guessing the issue is more widespread and should be addressed generically.

The PR in question: #8726

@rickerc
Copy link
Contributor

rickerc commented Apr 27, 2022

The issue is I think specific to apt-based systems and for runtimes where the Ansible for the container runtime does some variant of

  • stop apt-installed runtime
  • uninstall apt-installed runtime
  • download known good binary
  • register known good as systemd service
  • systemd start systemd service then fails because step 2 above left service masked

I added an unmask for cri-docker and docker to #8726 which will probably fix this. The container runtime install plays aren't all that cookie cutter so there may be additional tweaks needed....

@CharudathGopal
Copy link

JFYI, #8726 did not help.

Locally made below changes which seems to be helping, please take a look and suggest if its fine.

  • Added task to ensure cri-dockerd is started and enabled [roles/container-engine/cri-dockerd/tasks/main.yml]
- name: Ensure cri-dockerd is started and enabled
  systemd:
    name: cri-dockerd
    daemon_reload: yes
    enabled: yes
    state: started
  • Updated the cri-dockerd handler to below [roles/container-engine/cri-dockerd/handlers/main.yml]
- name: cri-dockerd | reload systemd
  systemd:
    name: cri-dockerd
    daemon_reload: true
    masked: no
    enabled: yes

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 12, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 11, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

6 participants