Power outtage while on vacation for 3 weeks, upon returning one node won't join cluster #9516

evanrich · 2024-10-17T05:26:48Z

evanrich
Oct 17, 2024

I had a power outtage in my area while away on vacation for 3 weeks. I had a UPS to power things, but the outtage lasted longer than the runtime of the UPSs. Upon returning from my trip, 3 of my 4 nodes were working, one was not. This consisted of 2/3 control planes and 1 worker node.

I found out the boot SSD had a number of URE's so I did a machine reset, swapped the SSD, and let talos re-install on the node. Upon finishing, It claims that all services are started, but the node will not switch to "READY" due to the CNI not starting. In the talos dashboard logs, I see this:

 user: warning: [2024-10-17T04:39:24.67523314Z]: [talos] controller failed {"component": "controller-runtime", "controller": "k8s.KubeletStaticPodController", "error": "error refreshing pod status: error fetching pod status: Get \"https://127.0.0.1:10250/pods/?
 timeout=30s\": remote error: tls: internal error"}
 user: warning: [2024-10-17T04:39:25.51923214Z]: [talos] controller failed {"component": "controller-runtime", "controller": "k8s.ManifestApplyController", "error": "error creating mapping for object /v1/Namespace/kubelet-serving-cert-approver: Get \"https://localhost:
 6443/api?timeout=32s\": dial tcp [::1]:6443: connect: connection refused"}
 user: warning: [2024-10-17T04:39:27.23112814Z]: [talos] controller failed {"component": "controller-runtime", "controller": "k8s.ManifestApplyController", "error": "error creating mapping for object /v1/Namespace/kubelet-serving-cert-approver: Get \"https://localhost:
 6443/api?timeout=32s\": dial tcp [::1]:6443: connect: connection refused"}
 user: warning: [2024-10-17T04:39:29.55689914Z]: [talos] controller failed {"component": "controller-runtime", "controller": "k8s.ManifestApplyController", "error": "error creating mapping for object /v1/Namespace/kubelet-serving-cert-approver: Get \"https://localhost:
 6443/api?timeout=32s\": dial tcp [::1]:6443: connect: connection refused"}
 user: warning: [2024-10-17T04:39:33.14382214Z]: [talos] controller failed {"component": "controller-runtime", "controller": "k8s.ManifestApplyController", "error": "error creating mapping for object /v1/Namespace/kubelet-serving-cert-approver: Get \"https://localhost:
 6443/api?timeout=32s\": dial tcp [::1]:6443: connect: connection refused"}
 user: warning: [2024-10-17T04:39:37.72383814Z]: [talos] controller failed {"component": "controller-runtime", "controller": "k8s.NodeApplyController", "error": "1 error(s) occurred:\n\terror getting node: Get \"https://localhost:6443/api/v1/nodes/node3?timeout=30s\":
 dial tcp [::1]:6443: connect: connection refused"}
 user: warning: [2024-10-17T04:39:38.02260514Z]: [talos] controller failed {"component": "controller-runtime", "controller": "k8s.ManifestApplyController", "error": "error creating mapping for object /v1/Namespace/kubelet-serving-cert-approver: Get \"https://localhost:
 6443/api?timeout=32s\": dial tcp [::1]:6443: connect: connection refused"}
 user: warning: [2024-10-17T04:39:39.42656514Z]: [talos] task startAllServices (1/1): service "ext-nut-client" to be "up"
 user: warning: [2024-10-17T04:39:40.29938914Z]: [talos] controller failed {"component": "controller-runtime", "controller": "k8s.KubeletStaticPodController", "error": "error refreshing pod status: error fetching pod status: Get \"https://127.0.0.1:10250/pods/?
 timeout=30s\": remote error: tls: internal error"}
 user: warning: [2024-10-17T04:39:54.42724914Z]: [talos] task startAllServices (1/1): service "ext-nut-client" to be "up"
 user: warning: [2024-10-17T04:39:56.27560114Z]: [talos] controller failed {"component": "controller-runtime", "controller": "k8s.KubeletStaticPodController", "error": "error refreshing pod status: error fetching pod status: Get \"https://127.0.0.1:10250/pods/?
 timeout=30s\": remote error: tls: internal error"}
 user: warning: [2024-10-17T04:40:09.42657414Z]: [talos] task startAllServices (1/1): service "ext-nut-client" to be "up"
 user: warning: [2024-10-17T04:40:24.42571914Z]: [talos] task startAllServices (1/1): service "ext-nut-client" to be "up"
 user: warning: [2024-10-17T04:40:39.42633414Z]: [talos] task startAllServices (1/1): service "ext-nut-client" to be "up"
 user: warning: [2024-10-17T04:40:54.42547914Z]: [talos] task startAllServices (1/1): service "ext-nut-client" to be "up"
 user: warning: [2024-10-17T04:41:01.54514214Z]: [talos] apply config request: mode auto(no_reboot)
 user: warning: [2024-10-17T04:41:01.61720814Z]: [talos] service[ext-nut-client](Preparing): Running pre state
 user: warning: [2024-10-17T04:41:01.69903614Z]: [talos] service[ext-nut-client](Preparing): Creating service runner
 user: warning: [2024-10-17T04:41:01.98927514Z]: [talos] service[ext-nut-client](Running): Started task ext-nut-client (PID 4481) for container ext-nut-client
 user: warning: [2024-10-17T04:41:02.12069114Z]: [talos] task startAllServices (1/1): done, 2m37.696251735s
 user: warning: [2024-10-17T04:41:02.19900314Z]: [talos] phase startEverything (16/16): done, 2m37.774623591s
 user: warning: [2024-10-17T04:41:02.27928514Z]: [talos] boot sequence: done: 2m55.848878807s

and the CNI (cillium) shows this:

time="2024-10-17T05:16:19Z" level=info msg="Establishing connection to apiserver" host="https://localhost:7445" subsys=k8s-client
time="2024-10-17T05:16:24Z" level=info msg="Establishing connection to apiserver" host="https://localhost:7445" subsys=k8s-client
time="2024-10-17T05:16:29Z" level=info msg="Establishing connection to apiserver" host="https://localhost:7445" subsys=k8s-client
time="2024-10-17T05:16:34Z" level=info msg="Establishing connection to apiserver" host="https://localhost:7445" subsys=k8s-client
time="2024-10-17T05:16:39Z" level=info msg="Establishing connection to apiserver" host="https://localhost:7445" subsys=k8s-client
time="2024-10-17T05:16:44Z" level=info msg="Establishing connection to apiserver" host="https://localhost:7445" subsys=k8s-client
time="2024-10-17T05:16:49Z" level=info msg="Establishing connection to apiserver" host="https://localhost:7445" subsys=k8s-client
time="2024-10-17T05:16:54Z" level=info msg="Establishing connection to apiserver" host="https://localhost:7445" subsys=k8s-client
time="2024-10-17T05:16:59Z" level=info msg="Establishing connection to apiserver" host="https://localhost:7445" subsys=k8s-client
time="2024-10-17T05:17:04Z" level=info msg="Establishing connection to apiserver" host="https://localhost:7445" subsys=k8s-client
time="2024-10-17T05:17:09Z" level=info msg="Establishing connection to apiserver" host="https://localhost:7445" subsys=k8s-client
time="2024-10-17T05:17:09Z" level=error msg="Unable to contact k8s api-server" error="Get \"https://localhost:7445/api/v1/namespaces/kube-system\": dial tcp [::1]:7445: connect: connection refused" ipAddr="https://localhost:7445" subsys=k8s-client
2024/10/17 05:17:09 ERROR Start hook failed function="client.(*compositeClientset).onStart (k8s-client)" error="Get \"https://localhost:7445/api/v1/namespaces/kube-system\": dial tcp [::1]:7445: connect: connection refused"
2024/10/17 05:17:09 ERROR Start failed error="Get \"https://localhost:7445/api/v1/namespaces/kube-system\": dial tcp [::1]:7445: connect: connection refused" duration=1m0.044401122s
2024/10/17 05:17:09 INFO Stopping
Error: Build config failed: failed to start: Get "https://localhost:7445/api/v1/namespaces/kube-system": dial tcp [::1]:7445: connect: connection refused
2024-10-17T05:17:09.188288460Z

I have tried resetting the machine mulltiple times, but nothing seems to get it to work. I tried following https://www.talos.dev/v1.8/introduction/troubleshooting/ docs, but all that it says there is to ensure the CSR for the node is approved, which it is due to using an auto approver for new nodes.

Some other things I've tried:

removing the node completely from k8s, resetting it, and then letting it join again
fully wiping the SSD and reimaging from scratch
killing cilium pods and letting it boot

I can't find any other settings to check as to why cillium wont start on this machine, but I believe it's something related to the API server. Is it possible that something's out of whack since the other 3 nodes were running the entire time after they came back but this node wasnt?

FWIW, when I tried resetting the machine the first time, it did fail on trying to leave the etcd cluster, but after reimaging it seems like it was ok.

Additional info:

 talosctl get etcdspecs -n 192.168.5.10,192.168.5.11,192.168.5.12
NODE           NAMESPACE   TYPE       ID     VERSION   NAME    ADVERTISEDADDRESSES   LISTENPEERADDRESSES   LISTENCLIENTADDRESSES
192.168.5.10   etcd        EtcdSpec   etcd   1         node1   ["192.168.5.10"]      ["0.0.0.0"]           ["0.0.0.0"]
192.168.5.11   etcd        EtcdSpec   etcd   1         node2   ["192.168.5.11"]      ["0.0.0.0"]           ["0.0.0.0"]
192.168.5.12   etcd        EtcdSpec   etcd   1         node3   ["192.168.5.12"]      ["0.0.0.0"]           ["0.0.0.0"]

evan@DESKTOP-MCOE11O:~$ talosctl get etcdconfigs -n 192.168.5.10,192.168.5.11,192.168.5.12
NODE           NAMESPACE   TYPE         ID     VERSION   IMAGE
192.168.5.10   etcd        EtcdConfig   etcd   1         gcr.io/etcd-development/etcd:v3.5.13
192.168.5.11   etcd        EtcdConfig   etcd   1         gcr.io/etcd-development/etcd:v3.5.13
192.168.5.12   etcd        EtcdConfig   etcd   1         gcr.io/etcd-development/etcd:v3.5.13

Talos version: 1.7.3
K8s version: 1.30.1

Thanks!

evanrich · 2024-10-17T05:31:11Z

evanrich
Oct 17, 2024
Author

Edit: fwiw, I was missing in the Machine config:

kubePrism:
            enabled: true
            port: 7445
        hostDNS:
            enabled: true
            forwardKubeDNSToHost: true
            resolveMemberNames: true

adding this fixed it

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Power outtage while on vacation for 3 weeks, upon returning one node won't join cluster #9516

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Power outtage while on vacation for 3 weeks, upon returning one node won't join cluster #9516

evanrich Oct 17, 2024

Replies: 1 comment

evanrich Oct 17, 2024 Author

evanrich
Oct 17, 2024

evanrich
Oct 17, 2024
Author