-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to add node to cluster after cluster-reset #6186
Comments
For the sake of completeness.... I create a config file on all the nodes: /etc/rancher/k3s/config.yaml
To install first node: All good at this point.. Shutdown both nodes with On second node I do a cluster reset On first server I do a I wait for a minute until the cluster looks happy on the second node... Pods are restarted. Everything looks happy. I remove the second node with Then I try to bootstrap the second node back into the cluster (making sure the config file is recreated) and then: Host names are redacted, DNS works for host names. |
Output of
|
Okay, I worked around the issue.... But wondering if it's still something that should be looked at... Even though I deleted the node using kubectl, clearly there is still some history of the node in the cluster somewhere... I was able to add the first node back into the cluster by manually setting a different |
When you delete the first node from the cluster, does it actually finish deleting - are both the node and the node password secret gone from the cluster? |
I am not totally sure how to tell if it fully finished deleting... But the node and the secret are indeed no longer there. The only reference I see is:
|
There must be some sort of delay for clearing things out when you delete the node but only sometimes. I've waited 5-10 minutes and it's still having issues with the same name.
|
I think I'm good here, I will close for now unless you want me to do any more digging on it. |
I tried from scratch again and I'm still having the same issue again. I don't know if it's timing thing. |
Okay, this is interesting.. After doing the cluster reset on one of the nodes bringing it back up and then trying to add a node back (that had been completely uninstalled and then re-installed) the first node that had the reset done shows this in
And then on the node I added back (totally uninstalled and re-installed) when I do get nodes:
Is etcd getting confused? Shouldn't etcd have a single master after |
Tried with v1.22.15+k3s1 and v1.24.6+k3s1 and they both seem to do the same thing. It's like it performs the initial sync of the new node with an old copy of the etcd database and they never come back into sync. |
This sounds very similar to etcd-io/etcd#14009 - unfortunately I haven't been able to reproduce it without involving Kubernetes, so upstream has had a hard time addressing it. |
Man, I can reproduce it nearly every time. I've done it no less than 30 times and it happens in 28 of them. I will try restoring a snapshot. I guess it doesn't matter if it's not etcd by itself. |
Okay, I verified I can cluster-reset, start, snapshot, cluster-reset with snapshot and get my cluster going again. |
This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 180 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions. |
Environmental Info:
K3s Version:
k3s version v1.24.5-rc1+k3s1 (fb823c8)
go version go1.18.6
Node(s) CPU architecture, OS, and Version:
Ubuntu 18.04
Linux firstnode 4.15.0-193-generic #204-Ubuntu SMP Fri Aug 26 19:20:21 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Cluster Configuration:
For now, just 2 servers (which I realize isn't ideal with etcd but I think it might be unrelated)
Describe the bug:
Steps To Reproduce:
--cluster-reset
on second nodekubectl
to delete the failed node (which sits in NotReady state)k3s-uninstall.sh
on first node to start cleank3s server --server <second node>
It never seems to be able to bring the first node back into the cluster. I am trialing k3s and simulating a server failure and trying to rebuild the cluster. The node just sits in
NotReady
The logs just keep saying over and over:
And eventually this error also starts to appear with the previous ones:
A couple more that showed up:
Expected behavior:
I should be able to add a node
Actual behavior:
Errors printed above
The text was updated successfully, but these errors were encountered: