Replies: 1 comment
-
Edit: fwiw, I was missing in the Machine config:
adding this fixed it |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I had a power outtage in my area while away on vacation for 3 weeks. I had a UPS to power things, but the outtage lasted longer than the runtime of the UPSs. Upon returning from my trip, 3 of my 4 nodes were working, one was not. This consisted of 2/3 control planes and 1 worker node.
I found out the boot SSD had a number of URE's so I did a machine reset, swapped the SSD, and let talos re-install on the node. Upon finishing, It claims that all services are started, but the node will not switch to "READY" due to the CNI not starting. In the talos dashboard logs, I see this:
and the CNI (cillium) shows this:
I have tried resetting the machine mulltiple times, but nothing seems to get it to work. I tried following https://www.talos.dev/v1.8/introduction/troubleshooting/ docs, but all that it says there is to ensure the CSR for the node is approved, which it is due to using an auto approver for new nodes.
Some other things I've tried:
I can't find any other settings to check as to why cillium wont start on this machine, but I believe it's something related to the API server. Is it possible that something's out of whack since the other 3 nodes were running the entire time after they came back but this node wasnt?
FWIW, when I tried resetting the machine the first time, it did fail on trying to leave the etcd cluster, but after reimaging it seems like it was ok.
Additional info:
Talos version: 1.7.3
K8s version: 1.30.1
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions