Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single node cluster does not recover from SIGHUP to containerd #9271

Closed
jfroy opened this issue Sep 4, 2024 · 9 comments · Fixed by #9273
Closed

Single node cluster does not recover from SIGHUP to containerd #9271

jfroy opened this issue Sep 4, 2024 · 9 comments · Fixed by #9273
Assignees

Comments

@jfroy
Copy link
Contributor

jfroy commented Sep 4, 2024

Bug Report

Description

On a single node test cluster, sending SIGHUP to containerd causes the cluster to remain in a bad state until reboot.

Reproduction

Logs

192.168.1.13: user: warning: [2024-09-04T15:33:02.169557586Z]: [talos] service[kubelet](Waiting): Error running Containerd(kubelet), going to restart forever: task "kubelet" failed: exit code 255
192.168.1.13: user: warning: [2024-09-04T15:33:02.324784586Z]: [talos] service[cri](Waiting): Error running Process(["/bin/containerd" "--address" "/run/containerd/containerd.sock" "--config" "/etc/cri/containerd.toml"]), going to restart forever: signal: hangup
192.168.1.13: user: warning: [2024-09-04T15:33:02.324811586Z]: [talos] removing shared IP {"component": "controller-runtime", "controller": "network.OperatorSpecController", "operator": "vip", "link": "bond0", "ip": "192.168.1.8"}
192.168.1.13: user: warning: [2024-09-04T15:33:02.324884586Z]: [talos] removed static pod {"component": "controller-runtime", "controller": "k8s.StaticPodServerController", "id": "kube-apiserver"}
192.168.1.13: user: warning: [2024-09-04T15:33:02.324920586Z]: [talos] removed static pod {"component": "controller-runtime", "controller": "k8s.StaticPodServerController", "id": "kube-controller-manager"}
192.168.1.13: user: warning: [2024-09-04T15:33:02.324929586Z]: [talos] removed static pod {"component": "controller-runtime", "controller": "k8s.StaticPodServerController", "id": "kube-scheduler"}
192.168.1.13: user: warning: [2024-09-04T15:33:02.324987586Z]: [talos] removed address 192.168.1.8/32 from "bond0" {"component": "controller-runtime", "controller": "network.AddressSpecController"}
192.168.1.13: user: warning: [2024-09-04T15:33:07.170148586Z]: [talos] service[etcd](Waiting): Error running Containerd(etcd), going to restart forever: failed to create task: "etcd": connection error: desc = "transport: error while dialing: dial unix /run/containerd/containerd.sock: connect: connection refused": unavailable
192.168.1.13: user: warning: [2024-09-04T15:33:07.461611586Z]: [talos] service[kubelet](Waiting): Error running Containerd(kubelet), going to restart forever: failed to create task: "kubelet": connection error: desc = "transport: error while dialing: dial unix /run/containerd/containerd.sock: connect: connection refused": unavailable
192.168.1.13: user: warning: [2024-09-04T15:33:07.762811586Z]: [talos] service[cri](Running): Process Process(["/bin/containerd" "--address" "/run/containerd/containerd.sock" "--config" "/etc/cri/containerd.toml"]) started with PID 114795
192.168.1.13: user: warning: [2024-09-04T15:33:12.462382586Z]: [talos] service[etcd](Waiting): Error running Containerd(etcd), going to restart forever: failed to create task: "etcd": connection error: desc = "transport: error while dialing: dial unix /run/containerd/containerd.sock: connect: connection refused": unavailable
192.168.1.13: user: warning: [2024-09-04T15:33:12.763485586Z]: [talos] service[kubelet](Waiting): Error running Containerd(kubelet), going to restart forever: failed to create task: "kubelet": connection error: desc = "transport: error while dialing: dial unix /run/containerd/containerd.sock: connect: connection refused": unavailable
192.168.1.13: user: warning: [2024-09-04T15:33:13.169984586Z]: [talos] service[cri](Running): Health check successful
192.168.1.13: user: warning: [2024-09-04T15:33:17.962409586Z]: [talos] service[etcd](Running): Started task etcd (PID 117967) for container etcd
192.168.1.13: user: warning: [2024-09-04T15:33:19.016708586Z]: [talos] node watch error {"component": "controller-runtime", "controller": "k8s.NodeStatusController", "error": "Get \"https://127.0.0.1:7445/api/v1/nodes?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dkantai1&resourceVersion=45227191&timeout=6m37s&timeoutSeconds=397&watch=true\": http2: client connection lost", "error_count": 0}
192.168.1.13: user: warning: [2024-09-04T15:33:20.105594586Z]: [talos] kubernetes registry node watch error {"component": "controller-runtime", "controller": "cluster.KubernetesPullController", "error": "Get \"https://127.0.0.1:7445/api/v1/nodes?allowWatchBookmarks=true&resourceVersion=45227191&timeout=9m22s&timeoutSeconds=562&watch=true\": http2: client connection lost"}
192.168.1.13: user: warning: [2024-09-04T15:33:20.616566586Z]: [talos] service[kubelet](Running): Started task kubelet (PID 118131) for container kubelet
192.168.1.13: user: warning: [2024-09-04T15:33:25.906839586Z]: [talos] kubernetes endpoint watch error {"component": "controller-runtime", "controller": "k8s.EndpointController", "error": "failed to list *v1.Endpoints: Get \"https://192.168.1.8:6443/api/v1/namespaces/default/endpoints?fieldSelector=metadata.name%3Dkubernetes&resourceVersion=45228871\": dial tcp 192.168.1.8:6443: connect: no route to host"}
192.168.1.13: user: warning: [2024-09-04T15:33:30.113315586Z]: [talos] controller failed {"component": "controller-runtime", "controller": "k8s.KubeletStaticPodController", "error": "error refreshing pod status: error fetching pod status: an error on the server (\"Authorization error (user=apiserver-kubelet-client, verb=get, resource=nodes, subresource=proxy)\") has prevented the request from succeeding"}
192.168.1.13: user: warning: [2024-09-04T15:33:30.490083586Z]: [talos] node watch error {"component": "controller-runtime", "controller": "k8s.NodeStatusController", "error": "failed to list *v1.Node: Get \"https://127.0.0.1:7445/api/v1/nodes?fieldSelector=metadata.name%3Dkantai1&resourceVersion=45227191\": EOF", "error_count": 1}
192.168.1.13: user: warning: [2024-09-04T15:33:31.696739586Z]: [talos] kubernetes registry node watch error {"component": "controller-runtime", "controller": "cluster.KubernetesPullController", "error": "failed to list *v1.Node: Get \"https://127.0.0.1:7445/api/v1/nodes?resourceVersion=45227191\": EOF"}
192.168.1.13: user: warning: [2024-09-04T15:33:31.965262586Z]: [talos] kubernetes endpoint watch error {"component": "controller-runtime", "controller": "k8s.EndpointController", "error": "failed to list *v1.Endpoints: Get \"https://192.168.1.8:6443/api/v1/namespaces/default/endpoints?fieldSelector=metadata.name%3Dkubernetes&resourceVersion=45228871\": dial tcp 192.168.1.8:6443: connect: no route to host"}
192.168.1.13: user: warning: [2024-09-04T15:33:39.186865586Z]: [talos] kubernetes endpoint watch error {"component": "controller-runtime", "controller": "k8s.EndpointController", "error": "failed to list *v1.Endpoints: Get \"https://192.168.1.8:6443/api/v1/namespaces/default/endpoints?fieldSelector=metadata.name%3Dkubernetes&resourceVersion=45228871\": dial tcp 192.168.1.8:6443: connect: no route to host"}
192.168.1.13: user: warning: [2024-09-04T15:33:42.619515586Z]: [talos] node watch error {"component": "controller-runtime", "controller": "k8s.NodeStatusController", "error": "failed to list *v1.Node: Get \"https://127.0.0.1:7445/api/v1/nodes?fieldSelector=metadata.name%3Dkantai1&resourceVersion=45227191\": EOF", "error_count": 2}
192.168.1.13: user: warning: [2024-09-04T15:33:44.957584586Z]: [talos] kubernetes registry node watch error {"component": "controller-runtime", "controller": "cluster.KubernetesPullController", "error": "failed to list *v1.Node: Get \"https://127.0.0.1:7445/api/v1/nodes?resourceVersion=45227191\": EOF"}
192.168.1.13: user: warning: [2024-09-04T15:33:45.966067586Z]: [talos] controller failed {"component": "controller-runtime", "controller": "k8s.KubeletStaticPodController", "error": "error refreshing pod status: error fetching pod status: an error on the server (\"Authorization error (user=apiserver-kubelet-client, verb=get, resource=nodes, subresource=proxy)\") has prevented the request from succeeding"}
192.168.1.13: user: warning: [2024-09-04T15:33:51.986865586Z]: [talos] kubernetes endpoint watch error {"component": "controller-runtime", "controller": "k8s.EndpointController", "error": "failed to list *v1.Endpoints: Get \"https://192.168.1.8:6443/api/v1/namespaces/default/endpoints?fieldSelector=metadata.name%3Dkubernetes&resourceVersion=45228871\": dial tcp 192.168.1.8:6443: connect: no route to host"}
192.168.1.13: user: warning: [2024-09-04T15:33:58.355483586Z]: [talos] node watch error {"component": "controller-runtime", "controller": "k8s.NodeStatusController", "error": "failed to list *v1.Node: Get \"https://127.0.0.1:7445/api/v1/nodes?fieldSelector=metadata.name%3Dkantai1&resourceVersion=45227191\": EOF", "error_count": 3}
192.168.1.13: user: warning: [2024-09-04T15:34:00.991490586Z]: [talos] kubernetes registry node watch error {"component": "controller-runtime", "controller": "cluster.KubernetesPullController", "error": "failed to list *v1.Node: Get \"https://127.0.0.1:7445/api/v1/nodes?resourceVersion=45227191\": EOF"}
192.168.1.13: user: warning: [2024-09-04T15:34:01.875312586Z]: [talos] controller failed {"component": "controller-runtime", "controller": "k8s.KubeletStaticPodController", "error": "error refreshing pod status: error fetching pod status: an error on the server (\"Authorization error (user=apiserver-kubelet-client, verb=get, resource=nodes, subresource=proxy)\") has prevented the request from succeeding"}
192.168.1.13: user: warning: [2024-09-04T15:34:14.514763586Z]: [talos] kubernetes endpoint watch error {"component": "controller-runtime", "controller": "k8s.EndpointController", "error": "failed to list *v1.Endpoints: Get \"https://192.168.1.8:6443/api/v1/namespaces/default/endpoints?fieldSelector=metadata.name%3Dkubernetes&resourceVersion=45228871\": dial tcp 192.168.1.8:6443: connect: no route to host"}
192.168.1.13: user: warning: [2024-09-04T15:34:17.752296586Z]: [talos] controller failed {"component": "controller-runtime", "controller": "k8s.KubeletStaticPodController", "error": "error refreshing pod status: error fetching pod status: an error on the server (\"Authorization error (user=apiserver-kubelet-client, verb=get, resource=nodes, subresource=proxy)\") has prevented the request from succeeding"}
192.168.1.13: user: warning: [2024-09-04T15:34:18.582256586Z]: [talos] kubernetes registry node watch error {"component": "controller-runtime", "controller": "cluster.KubernetesPullController", "error": "failed to list *v1.Node: Get \"https://127.0.0.1:7445/api/v1/nodes?resourceVersion=45227191\": EOF"}
192.168.1.13: user: warning: [2024-09-04T15:34:19.548487586Z]: [talos] node watch error {"component": "controller-runtime", "controller": "k8s.NodeStatusController", "error": "failed to list *v1.Node: Get \"https://127.0.0.1:7445/api/v1/nodes?fieldSelector=metadata.name%3Dkantai1&resourceVersion=45227191\": EOF", "error_count": 4}
192.168.1.13: user: warning: [2024-09-04T15:34:33.515184586Z]: [talos] controller failed {"component": "controller-runtime", "controller": "k8s.KubeletStaticPodController", "error": "error refreshing pod status: error fetching pod status: an error on the server (\"Authorization error (user=apiserver-kubelet-client, verb=get, resource=nodes, subresource=proxy)\") has prevented the request from succeeding"}
192.168.1.13: user: warning: [2024-09-04T15:34:43.009688586Z]: [talos] kubernetes registry node watch error {"component": "controller-runtime", "controller": "cluster.KubernetesPullController", "error": "failed to list *v1.Node: Get \"https://127.0.0.1:7445/api/v1/nodes?resourceVersion=45227191\": EOF"}
192.168.1.13: user: warning: [2024-09-04T15:34:49.476116586Z]: [talos] controller failed {"component": "controller-runtime", "controller": "k8s.KubeletStaticPodController", "error": "error refreshing pod status: error fetching pod status: an error on the server (\"Authorization error (user=apiserver-kubelet-client, verb=get, resource=nodes, subresource=proxy)\") has prevented the request from succeeding"}
192.168.1.13: user: warning: [2024-09-04T15:34:53.106714586Z]: [talos] kubernetes endpoint watch error {"component": "controller-runtime", "controller": "k8s.EndpointController", "error": "failed to list *v1.Endpoints: Get \"https://192.168.1.8:6443/api/v1/namespaces/default/endpoints?fieldSelector=metadata.name%3Dkubernetes&resourceVersion=45228871\": dial tcp 192.168.1.8:6443: connect: no route to host"}
192.168.1.13: user: warning: [2024-09-04T15:35:05.204704586Z]: [talos] controller failed {"component": "controller-runtime", "controller": "k8s.KubeletStaticPodController", "error": "error refreshing pod status: error fetching pod status: an error on the server (\"Authorization error (user=apiserver-kubelet-client, verb=get, resource=nodes, subresource=proxy)\") has prevented the request from succeeding"}
192.168.1.13: user: warning: [2024-09-04T15:35:21.277033586Z]: [talos] controller failed {"component": "controller-runtime", "controller": "k8s.KubeletStaticPodController", "error": "error refreshing pod status: error fetching pod status: an error on the server (\"Authorization error (user=apiserver-kubelet-client, verb=get, resource=nodes, subresource=proxy)\") has prevented the request from succeeding"}

The controller-runtime logs repeat from then on.

Environment

  • Talos version:
    Client:
    Tag: v1.7.6
    SHA: ae67123
    Built:
    Go version: go1.22.5
    OS/Arch: linux/amd64
    Server:
    NODE: 192.168.1.13
    Tag: v1.7.6
    SHA: ae67123-dirty
    Built:
    Go version: go1.22.5
    OS/Arch: linux/amd64
    Enabled: RBAC

  • Kubernetes version: v1.30.4

  • Platform: baremetal x86-64

@smira
Copy link
Member

smira commented Sep 4, 2024

Can you please send talosctl support bundle for this?

The core issue I guess is that Talos lost its idea of that it's ready to run a controlplane, but not sure why.

@jfroy
Copy link
Contributor Author

jfroy commented Sep 4, 2024

Here you go. Happy to provide more info if needed.

support.zip

@jfroy
Copy link
Contributor Author

jfroy commented Sep 4, 2024

That bundle was generated with the cluster is a good steady state, not after SIGHUP-ing containerd. I can generate another one in the bad state, though I am not sure that will work since apiserver won't be reachable.

@smira
Copy link
Member

smira commented Sep 4, 2024

Bad state is interesting, apiserver is not needed, as Talos API should work either way (it's not affected).

@jfroy
Copy link
Contributor Author

jfroy commented Sep 4, 2024

Here you go.

support-badstate.zip

@smira
Copy link
Member

smira commented Sep 4, 2024

Thank you, I'll take look!

@smira
Copy link
Member

smira commented Sep 4, 2024

The root cause of the bug is:

metadata:
    namespace: runtime
    type: Services.v1alpha1.talos.dev
    id: etcd
    version: 1
    owner: v1alpha1.ServiceController
    phase: running
    created: 2024-09-04T17:52:50Z
    updated: 2024-09-04T17:52:50Z
spec:
    running: true
    healthy: false
    unknown: true

Due to a missing internal event, Talos considers etcd to be not healthy, and doesn't run Kubernetes control plane pods.

@dejoebad
Copy link

dejoebad commented Sep 5, 2024

i'm experience the same. in my case. how to produce is:

  1. using 2 CPs (192.168.11.100,192.168.11.101), using 1 vip 192.168.11.88
  2. both running healthy and ready
  3. deploy php apache
  4. testing with 192.168.11.100:30004, works
  5. testing with 192.168.11.101:30004, works
  6. testing with 192.168.11.88:30004, works
  7. shutdown node 192.168.11.101
  8. testing with 192.168.11.100:30004, works
  9. testing with 192.168.11.88:30004, works
  10. testing with 192.168.11.101:30004, NOT working (due to shutdown)
  11. at this point, node 192.168.11.100, missing its vip (vip is removed), controller-manager unhealthy, and never recovered since.

i suspect, the vip is then removed, after doing point 10

@smira
Copy link
Member

smira commented Sep 5, 2024

i'm experience the same. in my case. how to produce is:

your case is not same (if you have a problem, please open a separate issue with relevant support logs attached), and VIP is not supposed to work for workloads, only for Kubernetes API server.

jfroy pushed a commit to jfroy/siderolabs-talos that referenced this issue Sep 6, 2024
Otherwise the internal code might assume that the service is still
running and healthy, never issuing a health change event.

Fixes siderolabs#9271

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
smira added a commit to smira/talos that referenced this issue Sep 25, 2024
Otherwise the internal code might assume that the service is still
running and healthy, never issuing a health change event.

Fixes siderolabs#9271

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
(cherry picked from commit 07b9179)
smira added a commit to smira/talos that referenced this issue Sep 25, 2024
Otherwise the internal code might assume that the service is still
running and healthy, never issuing a health change event.

Fixes siderolabs#9271

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
(cherry picked from commit 07b9179)
smira added a commit to smira/talos that referenced this issue Sep 25, 2024
Otherwise the internal code might assume that the service is still
running and healthy, never issuing a health change event.

Fixes siderolabs#9271

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
(cherry picked from commit 07b9179)
smira added a commit to smira/talos that referenced this issue Sep 25, 2024
Otherwise the internal code might assume that the service is still
running and healthy, never issuing a health change event.

Fixes siderolabs#9271

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
(cherry picked from commit 07b9179)
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 5, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants