-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
k3s node fails to rejoin HA cluster when ip changes #11778
Comments
Your nodes, and ESPECIALLY your server nodes, must have fixed IP addresses. Whether you use static IPs, or DHCP reservations, or something else, doesn't matter. They just can't have IPs that change randomly or whenever it restarts. If the IPs do change, you need to delete the node from the cluster and rejoin it. If all members have new IPs, you'd need to do a cluster-reset on the first node, and then rejoin the others. |
Is this documented anywhere? The only documentation stating this that I've found for k8s or k3s says that while the server address given to the nodes during init has to be fixed, and that a load balancer is recommended, fixed addresses for the nodes themselves does not seem to be mentioned anywhere. Online discussions all over the place have people saying opposite things, some saying as you did that they must be unique, some saying that they do not need to be. Mentioning this in the setup/config/requirements for the project would really be helpful. |
I guess not, but given that it's only come up a handful of time I guess most folks just tend to have suitable environments? I am not sure why someone would want a server whose address changes randomly. Node names and IPs definitely need to be unique. They should also be static. |
I'll file a doc request then.
Maybe in the issue tracker but seems common enough elsewhere.
If you don't mind me getting a little flippant: Because we've had DNS for over 40 years now, DHCP for over 30 years, and I don't care (or understand why I should) what addresses backend servers have when I only connect to them via their names or the name of the load balancer in front of them. The environment I described in the original post is a good enough reason for me. Every week I spin up clusters from 3 to 9 nodes and usually have 2 or 3 of them going at any one time. Not having DNS "working" here and requiring static IPs means I need to:
That's a fair bit of tedious busy work that would be entirely mitigated if the cluster could just be configured to use names instead of addresses. My DNS is reliable, and if it's not, that seems like my problem to handle.
Of course. Thanks for hearing out the rant at least. |
Environmental Info:
K3s Version:
k3s version v1.30.6+k3s1 (1829eaa)
go version go1.22.8
Node(s) CPU architecture, OS, and Version:
Linux r01-k3s01 5.15.0-92-generic #102-Ubuntu SMP Wed Jan 10 09:33:48 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Cluster Configuration:
3 nodes running on Ubuntu 22.04.2 LTS. Nodes get addresses via DHCP. During install the first node is initialized with
curl -sfL https://get.k3s.io | sh -s - server --cluster-init --tls-san fqdn.of.haproxy.lb
. and the remaining nodes are added viacurl -sfL https://get.k3s.io | sh -s server --server https://fqdn.of.haproxy.lb
.fqdn.of.haproxy.lb
is an haproxy instance running in a separate VM doing round-robin balancing to all nodes on 6443, using their hostnames.Describe the bug:
If one of the nodes is shut down and has a different IP when it comes back up, it fails to rejoin the cluster. The
journalctl
log on the failing node is filled with repeated messages statingFailed to test data store connection: this server is a not a member of the etcd cluster. Found [r01-k3s03-6d718e3e=https://192.168.115.8:2380 r01-k3s02-c5f4597e=https://192.168.115.7:2380 r01-k3s01-94ae9d73=https://192.168.115.129:2380], expect: r01-k3s02-c5f4597e=https://192.168.115.9:2380"
The node in question in this case is k3s02 and as can be seen, the IP has changed from 192.168.115.7 to 192.168.115.9.
Steps To Reproduce:
Install a cluster as described, shut down one node, and bring it up with a different address, for example by changing it's DHCP reservation.
Expected behavior:
The node should successfully rejoin the cluster. My understanding is that node IPs can be freely changed in this way and static IPs / DHCP reservations are not needed.
Actual behavior:
The node did not rejoin the cluster.
Additional context / logs:
I routinely bring up and shut down clusters using a set of in-house IAAS scripts using terraform, ansible, etc. This has been working fine for a few years but I never had occasion to test shutting down and bringing any nodes back up before now.
The number of nodes in each cluster and the number of clusters in total is variable. I would really like to understand how to make this work if this is user error and not an actual bug without having to set up
worst-case-number-of-clusters x worst-case-number-of-nodes-per-cluster
DHCP reservations and statically assigned MACs for all of them.The text was updated successfully, but these errors were encountered: