Single node cluster does not recover from SIGHUP to containerd #9271

jfroy · 2024-09-04T15:37:46Z

Bug Report

Description

On a single node test cluster, sending SIGHUP to containerd causes the cluster to remain in a bad state until reboot.

Reproduction

Start a shell in the root/initial PID namespace. I used https://github.com/kvaps/kubectl-node-shell.
Get the PID of the worker containerd process and send SIGHUP to it.

Logs

192.168.1.13: user: warning: [2024-09-04T15:33:02.169557586Z]: [talos] service[kubelet](Waiting): Error running Containerd(kubelet), going to restart forever: task "kubelet" failed: exit code 255
192.168.1.13: user: warning: [2024-09-04T15:33:02.324784586Z]: [talos] service[cri](Waiting): Error running Process(["/bin/containerd" "--address" "/run/containerd/containerd.sock" "--config" "/etc/cri/containerd.toml"]), going to restart forever: signal: hangup
192.168.1.13: user: warning: [2024-09-04T15:33:02.324811586Z]: [talos] removing shared IP {"component": "controller-runtime", "controller": "network.OperatorSpecController", "operator": "vip", "link": "bond0", "ip": "192.168.1.8"}
192.168.1.13: user: warning: [2024-09-04T15:33:02.324884586Z]: [talos] removed static pod {"component": "controller-runtime", "controller": "k8s.StaticPodServerController", "id": "kube-apiserver"}
192.168.1.13: user: warning: [2024-09-04T15:33:02.324920586Z]: [talos] removed static pod {"component": "controller-runtime", "controller": "k8s.StaticPodServerController", "id": "kube-controller-manager"}
192.168.1.13: user: warning: [2024-09-04T15:33:02.324929586Z]: [talos] removed static pod {"component": "controller-runtime", "controller": "k8s.StaticPodServerController", "id": "kube-scheduler"}
192.168.1.13: user: warning: [2024-09-04T15:33:02.324987586Z]: [talos] removed address 192.168.1.8/32 from "bond0" {"component": "controller-runtime", "controller": "network.AddressSpecController"}
192.168.1.13: user: warning: [2024-09-04T15:33:07.170148586Z]: [talos] service[etcd](Waiting): Error running Containerd(etcd), going to restart forever: failed to create task: "etcd": connection error: desc = "transport: error while dialing: dial unix /run/containerd/containerd.sock: connect: connection refused": unavailable
192.168.1.13: user: warning: [2024-09-04T15:33:07.461611586Z]: [talos] service[kubelet](Waiting): Error running Containerd(kubelet), going to restart forever: failed to create task: "kubelet": connection error: desc = "transport: error while dialing: dial unix /run/containerd/containerd.sock: connect: connection refused": unavailable
192.168.1.13: user: warning: [2024-09-04T15:33:07.762811586Z]: [talos] service[cri](Running): Process Process(["/bin/containerd" "--address" "/run/containerd/containerd.sock" "--config" "/etc/cri/containerd.toml"]) started with PID 114795
192.168.1.13: user: warning: [2024-09-04T15:33:12.462382586Z]: [talos] service[etcd](Waiting): Error running Containerd(etcd), going to restart forever: failed to create task: "etcd": connection error: desc = "transport: error while dialing: dial unix /run/containerd/containerd.sock: connect: connection refused": unavailable
192.168.1.13: user: warning: [2024-09-04T15:33:12.763485586Z]: [talos] service[kubelet](Waiting): Error running Containerd(kubelet), going to restart forever: failed to create task: "kubelet": connection error: desc = "transport: error while dialing: dial unix /run/containerd/containerd.sock: connect: connection refused": unavailable
192.168.1.13: user: warning: [2024-09-04T15:33:13.169984586Z]: [talos] service[cri](Running): Health check successful
192.168.1.13: user: warning: [2024-09-04T15:33:17.962409586Z]: [talos] service[etcd](Running): Started task etcd (PID 117967) for container etcd
192.168.1.13: user: warning: [2024-09-04T15:33:19.016708586Z]: [talos] node watch error {"component": "controller-runtime", "controller": "k8s.NodeStatusController", "error": "Get \"https://127.0.0.1:7445/api/v1/nodes?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dkantai1&resourceVersion=45227191&timeout=6m37s&timeoutSeconds=397&watch=true\": http2: client connection lost", "error_count": 0}
192.168.1.13: user: warning: [2024-09-04T15:33:20.105594586Z]: [talos] kubernetes registry node watch error {"component": "controller-runtime", "controller": "cluster.KubernetesPullController", "error": "Get \"https://127.0.0.1:7445/api/v1/nodes?allowWatchBookmarks=true&resourceVersion=45227191&timeout=9m22s&timeoutSeconds=562&watch=true\": http2: client connection lost"}
192.168.1.13: user: warning: [2024-09-04T15:33:20.616566586Z]: [talos] service[kubelet](Running): Started task kubelet (PID 118131) for container kubelet
192.168.1.13: user: warning: [2024-09-04T15:33:25.906839586Z]: [talos] kubernetes endpoint watch error {"component": "controller-runtime", "controller": "k8s.EndpointController", "error": "failed to list *v1.Endpoints: Get \"https://192.168.1.8:6443/api/v1/namespaces/default/endpoints?fieldSelector=metadata.name%3Dkubernetes&resourceVersion=45228871\": dial tcp 192.168.1.8:6443: connect: no route to host"}
192.168.1.13: user: warning: [2024-09-04T15:33:30.113315586Z]: [talos] controller failed {"component": "controller-runtime", "controller": "k8s.KubeletStaticPodController", "error": "error refreshing pod status: error fetching pod status: an error on the server (\"Authorization error (user=apiserver-kubelet-client, verb=get, resource=nodes, subresource=proxy)\") has prevented the request from succeeding"}
192.168.1.13: user: warning: [2024-09-04T15:33:30.490083586Z]: [talos] node watch error {"component": "controller-runtime", "controller": "k8s.NodeStatusController", "error": "failed to list *v1.Node: Get \"https://127.0.0.1:7445/api/v1/nodes?fieldSelector=metadata.name%3Dkantai1&resourceVersion=45227191\": EOF", "error_count": 1}
192.168.1.13: user: warning: [2024-09-04T15:33:31.696739586Z]: [talos] kubernetes registry node watch error {"component": "controller-runtime", "controller": "cluster.KubernetesPullController", "error": "failed to list *v1.Node: Get \"https://127.0.0.1:7445/api/v1/nodes?resourceVersion=45227191\": EOF"}
192.168.1.13: user: warning: [2024-09-04T15:33:31.965262586Z]: [talos] kubernetes endpoint watch error {"component": "controller-runtime", "controller": "k8s.EndpointController", "error": "failed to list *v1.Endpoints: Get \"https://192.168.1.8:6443/api/v1/namespaces/default/endpoints?fieldSelector=metadata.name%3Dkubernetes&resourceVersion=45228871\": dial tcp 192.168.1.8:6443: connect: no route to host"}
192.168.1.13: user: warning: [2024-09-04T15:33:39.186865586Z]: [talos] kubernetes endpoint watch error {"component": "controller-runtime", "controller": "k8s.EndpointController", "error": "failed to list *v1.Endpoints: Get \"https://192.168.1.8:6443/api/v1/namespaces/default/endpoints?fieldSelector=metadata.name%3Dkubernetes&resourceVersion=45228871\": dial tcp 192.168.1.8:6443: connect: no route to host"}
192.168.1.13: user: warning: [2024-09-04T15:33:42.619515586Z]: [talos] node watch error {"component": "controller-runtime", "controller": "k8s.NodeStatusController", "error": "failed to list *v1.Node: Get \"https://127.0.0.1:7445/api/v1/nodes?fieldSelector=metadata.name%3Dkantai1&resourceVersion=45227191\": EOF", "error_count": 2}
192.168.1.13: user: warning: [2024-09-04T15:33:44.957584586Z]: [talos] kubernetes registry node watch error {"component": "controller-runtime", "controller": "cluster.KubernetesPullController", "error": "failed to list *v1.Node: Get \"https://127.0.0.1:7445/api/v1/nodes?resourceVersion=45227191\": EOF"}
192.168.1.13: user: warning: [2024-09-04T15:33:45.966067586Z]: [talos] controller failed {"component": "controller-runtime", "controller": "k8s.KubeletStaticPodController", "error": "error refreshing pod status: error fetching pod status: an error on the server (\"Authorization error (user=apiserver-kubelet-client, verb=get, resource=nodes, subresource=proxy)\") has prevented the request from succeeding"}
192.168.1.13: user: warning: [2024-09-04T15:33:51.986865586Z]: [talos] kubernetes endpoint watch error {"component": "controller-runtime", "controller": "k8s.EndpointController", "error": "failed to list *v1.Endpoints: Get \"https://192.168.1.8:6443/api/v1/namespaces/default/endpoints?fieldSelector=metadata.name%3Dkubernetes&resourceVersion=45228871\": dial tcp 192.168.1.8:6443: connect: no route to host"}
192.168.1.13: user: warning: [2024-09-04T15:33:58.355483586Z]: [talos] node watch error {"component": "controller-runtime", "controller": "k8s.NodeStatusController", "error": "failed to list *v1.Node: Get \"https://127.0.0.1:7445/api/v1/nodes?fieldSelector=metadata.name%3Dkantai1&resourceVersion=45227191\": EOF", "error_count": 3}
192.168.1.13: user: warning: [2024-09-04T15:34:00.991490586Z]: [talos] kubernetes registry node watch error {"component": "controller-runtime", "controller": "cluster.KubernetesPullController", "error": "failed to list *v1.Node: Get \"https://127.0.0.1:7445/api/v1/nodes?resourceVersion=45227191\": EOF"}
192.168.1.13: user: warning: [2024-09-04T15:34:01.875312586Z]: [talos] controller failed {"component": "controller-runtime", "controller": "k8s.KubeletStaticPodController", "error": "error refreshing pod status: error fetching pod status: an error on the server (\"Authorization error (user=apiserver-kubelet-client, verb=get, resource=nodes, subresource=proxy)\") has prevented the request from succeeding"}
192.168.1.13: user: warning: [2024-09-04T15:34:14.514763586Z]: [talos] kubernetes endpoint watch error {"component": "controller-runtime", "controller": "k8s.EndpointController", "error": "failed to list *v1.Endpoints: Get \"https://192.168.1.8:6443/api/v1/namespaces/default/endpoints?fieldSelector=metadata.name%3Dkubernetes&resourceVersion=45228871\": dial tcp 192.168.1.8:6443: connect: no route to host"}
192.168.1.13: user: warning: [2024-09-04T15:34:17.752296586Z]: [talos] controller failed {"component": "controller-runtime", "controller": "k8s.KubeletStaticPodController", "error": "error refreshing pod status: error fetching pod status: an error on the server (\"Authorization error (user=apiserver-kubelet-client, verb=get, resource=nodes, subresource=proxy)\") has prevented the request from succeeding"}
192.168.1.13: user: warning: [2024-09-04T15:34:18.582256586Z]: [talos] kubernetes registry node watch error {"component": "controller-runtime", "controller": "cluster.KubernetesPullController", "error": "failed to list *v1.Node: Get \"https://127.0.0.1:7445/api/v1/nodes?resourceVersion=45227191\": EOF"}
192.168.1.13: user: warning: [2024-09-04T15:34:19.548487586Z]: [talos] node watch error {"component": "controller-runtime", "controller": "k8s.NodeStatusController", "error": "failed to list *v1.Node: Get \"https://127.0.0.1:7445/api/v1/nodes?fieldSelector=metadata.name%3Dkantai1&resourceVersion=45227191\": EOF", "error_count": 4}
192.168.1.13: user: warning: [2024-09-04T15:34:33.515184586Z]: [talos] controller failed {"component": "controller-runtime", "controller": "k8s.KubeletStaticPodController", "error": "error refreshing pod status: error fetching pod status: an error on the server (\"Authorization error (user=apiserver-kubelet-client, verb=get, resource=nodes, subresource=proxy)\") has prevented the request from succeeding"}
192.168.1.13: user: warning: [2024-09-04T15:34:43.009688586Z]: [talos] kubernetes registry node watch error {"component": "controller-runtime", "controller": "cluster.KubernetesPullController", "error": "failed to list *v1.Node: Get \"https://127.0.0.1:7445/api/v1/nodes?resourceVersion=45227191\": EOF"}
192.168.1.13: user: warning: [2024-09-04T15:34:49.476116586Z]: [talos] controller failed {"component": "controller-runtime", "controller": "k8s.KubeletStaticPodController", "error": "error refreshing pod status: error fetching pod status: an error on the server (\"Authorization error (user=apiserver-kubelet-client, verb=get, resource=nodes, subresource=proxy)\") has prevented the request from succeeding"}
192.168.1.13: user: warning: [2024-09-04T15:34:53.106714586Z]: [talos] kubernetes endpoint watch error {"component": "controller-runtime", "controller": "k8s.EndpointController", "error": "failed to list *v1.Endpoints: Get \"https://192.168.1.8:6443/api/v1/namespaces/default/endpoints?fieldSelector=metadata.name%3Dkubernetes&resourceVersion=45228871\": dial tcp 192.168.1.8:6443: connect: no route to host"}
192.168.1.13: user: warning: [2024-09-04T15:35:05.204704586Z]: [talos] controller failed {"component": "controller-runtime", "controller": "k8s.KubeletStaticPodController", "error": "error refreshing pod status: error fetching pod status: an error on the server (\"Authorization error (user=apiserver-kubelet-client, verb=get, resource=nodes, subresource=proxy)\") has prevented the request from succeeding"}
192.168.1.13: user: warning: [2024-09-04T15:35:21.277033586Z]: [talos] controller failed {"component": "controller-runtime", "controller": "k8s.KubeletStaticPodController", "error": "error refreshing pod status: error fetching pod status: an error on the server (\"Authorization error (user=apiserver-kubelet-client, verb=get, resource=nodes, subresource=proxy)\") has prevented the request from succeeding"}

The controller-runtime logs repeat from then on.

Environment

Talos version:
Client:
Tag: v1.7.6
SHA: ae67123
Built:
Go version: go1.22.5
OS/Arch: linux/amd64
Server:
NODE: 192.168.1.13
Tag: v1.7.6
SHA: ae67123-dirty
Built:
Go version: go1.22.5
OS/Arch: linux/amd64
Enabled: RBAC
Kubernetes version: v1.30.4
Platform: baremetal x86-64

The text was updated successfully, but these errors were encountered:

smira · 2024-09-04T16:16:38Z

Can you please send talosctl support bundle for this?

The core issue I guess is that Talos lost its idea of that it's ready to run a controlplane, but not sure why.

jfroy · 2024-09-04T17:25:52Z

Here you go. Happy to provide more info if needed.

support.zip

jfroy · 2024-09-04T17:27:30Z

That bundle was generated with the cluster is a good steady state, not after SIGHUP-ing containerd. I can generate another one in the bad state, though I am not sure that will work since apiserver won't be reachable.

smira · 2024-09-04T17:50:41Z

Bad state is interesting, apiserver is not needed, as Talos API should work either way (it's not affected).

jfroy · 2024-09-04T17:54:13Z

Here you go.

support-badstate.zip

smira · 2024-09-04T17:59:48Z

Thank you, I'll take look!

smira · 2024-09-04T18:20:26Z

The root cause of the bug is:

metadata:
    namespace: runtime
    type: Services.v1alpha1.talos.dev
    id: etcd
    version: 1
    owner: v1alpha1.ServiceController
    phase: running
    created: 2024-09-04T17:52:50Z
    updated: 2024-09-04T17:52:50Z
spec:
    running: true
    healthy: false
    unknown: true

Due to a missing internal event, Talos considers etcd to be not healthy, and doesn't run Kubernetes control plane pods.

dejoebad · 2024-09-05T09:12:02Z

i'm experience the same. in my case. how to produce is:

using 2 CPs (192.168.11.100,192.168.11.101), using 1 vip 192.168.11.88
both running healthy and ready
deploy php apache
testing with 192.168.11.100:30004, works
testing with 192.168.11.101:30004, works
testing with 192.168.11.88:30004, works
shutdown node 192.168.11.101
testing with 192.168.11.100:30004, works
testing with 192.168.11.88:30004, works
testing with 192.168.11.101:30004, NOT working (due to shutdown)
at this point, node 192.168.11.100, missing its vip (vip is removed), controller-manager unhealthy, and never recovered since.

i suspect, the vip is then removed, after doing point 10

smira · 2024-09-05T09:33:02Z

i'm experience the same. in my case. how to produce is:

your case is not same (if you have a problem, please open a separate issue with relevant support logs attached), and VIP is not supposed to work for workloads, only for Kubernetes API server.

Otherwise the internal code might assume that the service is still running and healthy, never issuing a health change event. Fixes siderolabs#9271 Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>

Otherwise the internal code might assume that the service is still running and healthy, never issuing a health change event. Fixes siderolabs#9271 Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com> (cherry picked from commit 07b9179)

smira mentioned this issue Sep 4, 2024

fix: report internally service as unhealthy if not running #9273

Merged

smira self-assigned this Sep 4, 2024

talos-bot closed this as completed in 07b9179 Sep 5, 2024

github-actions bot locked as resolved and limited conversation to collaborators Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Single node cluster does not recover from SIGHUP to containerd #9271

Single node cluster does not recover from SIGHUP to containerd #9271

jfroy commented Sep 4, 2024 •

edited

Loading

smira commented Sep 4, 2024

jfroy commented Sep 4, 2024

jfroy commented Sep 4, 2024

smira commented Sep 4, 2024

jfroy commented Sep 4, 2024

smira commented Sep 4, 2024

smira commented Sep 4, 2024

dejoebad commented Sep 5, 2024 •

edited

Loading

smira commented Sep 5, 2024

Single node cluster does not recover from SIGHUP to containerd #9271

Single node cluster does not recover from SIGHUP to containerd #9271

Comments

jfroy commented Sep 4, 2024 • edited Loading

Bug Report

Description

Reproduction

Logs

Environment

smira commented Sep 4, 2024

jfroy commented Sep 4, 2024

jfroy commented Sep 4, 2024

smira commented Sep 4, 2024

jfroy commented Sep 4, 2024

smira commented Sep 4, 2024

smira commented Sep 4, 2024

dejoebad commented Sep 5, 2024 • edited Loading

smira commented Sep 5, 2024

jfroy commented Sep 4, 2024 •

edited

Loading

dejoebad commented Sep 5, 2024 •

edited

Loading