Unable to change cluster.endpoint without downtime #9609

NikolaiBessonov · 2024-10-31T14:35:25Z

Bug Report

Unable to authenticate after changing cluster.controlPlane.endpoint in machineconfig

Description

After updating the cluster.controlPlane.endpoint to point to controlplane-3, authentication from kube-apiserver fails. It seems that the issue is related to the absence of the --service-account-issuer argument, which should contain the new VIP address. This issue is likely occurring because the nodes are still referencing the old load balancer endpoint, resulting in a mismatch. But there are no possibility to set up additional param.

Steps to Reproduce

1.	Set up an external load balancer balancing traffic between three control plane nodes on port 6443.
2.	Point cluster.controlPlane.endpoint to this load balancer.
3.	Add a VIP and assign it to interface **network.interfaces.interface[0].vip.ip.**
4.	Change the **cluster.controlPlane.endpoint** on one of the nodes
5.	Check kube-apiserver logs on the node, where you changed that

Logs

kube-apiserver on the node, where you changed cluster.endpoint:
authentication.go:73] "Unable to authenticate the request" err="invalid bearer token"

Environment

Talos version: v1.7.5
Kubernetes version: 1.30.4
Platform: linux/amd64

Question

Until you fix it, is there any way to change cluster.endpoint?

The text was updated successfully, but these errors were encountered:

smira · 2024-10-31T14:58:51Z

I'm not quite sure what the question is, service-account-issuer is equal to the controlplane endpoint in Talos.

By downtime I guess you mean that service account tokens stop working? (these are used only by workload pods)

As no communication in the cluster between components will be broken if you change the endpoint, a simple pod restart for those using service accounts will be sufficient.

NikolaiBessonov · 2024-11-01T07:21:08Z

@smira not quite that.

For example if I put new address(vip) to endpoint on one of the controlplane node - api server stops working, because can't authenticate. I think if I change all endpoints on all my controlplane nodes, some components will require restart(such as cilium cni components. It also can't auth on controlplane with new endpoint), and it leads to downtime, until restart all components.

But if there was support of adding additional param "service-account-issuer" - where I could specify additional loadbalancer(old) on nodes, it would be without any errors and downtime. Similar to point 8 "Migration from kubeadm. Step-by-Step guide" in docs

smira · 2024-11-01T10:16:40Z

Yes, service-account-issuer might be done better, but I guess it has nothing to do with Cilium.

First of all, Cilium should be configured to use Talos KubePrism endpoint - that's way better than using actual cluster endpoint.

I guess what happens is that chainging endpoint re-rolls kube-apiserver certificate, and old/new certificates don't match for you, which can be solved by updading certSANs, but once again service-token-issuer should be made configurable, but your issue is something else.

NikolaiBessonov · 2024-11-08T09:12:26Z

@smira Sory for delayed response. We're testing it and applied to production cluster. Changing the endpoints(on all three control planes) involves restarting all pods that are connected in any way to the KubeAPI server, such as Cilium, Cert-manager, ingress-controller etc...
If you have a simple script, which find all serviceAccounts and their pods, it's will be faster and take at least 1 minute of downtime.
Unfortunately, this is the fastest way I found.
I have no more questions. We can close the issue, but it will be better if you add the ability to do that without downtime and restart all pods

smira · 2024-11-08T10:28:51Z

I think the issue itself makes sense, and what you would like to see is to have additional values for service-account-issuers, that is previous controlplane endpoint - so that tokens are considered to be valid.

smira mentioned this issue Nov 8, 2024

Talos 1.9 Release Checklist ✅ #9249

Closed

smira mentioned this issue Dec 9, 2024

Talos 1.10 Release Checklist #9899

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to change cluster.endpoint without downtime #9609

Unable to change cluster.endpoint without downtime #9609

NikolaiBessonov commented Oct 31, 2024

smira commented Oct 31, 2024

NikolaiBessonov commented Nov 1, 2024

smira commented Nov 1, 2024

NikolaiBessonov commented Nov 8, 2024

smira commented Nov 8, 2024

Unable to change cluster.endpoint without downtime #9609

Unable to change cluster.endpoint without downtime #9609

Comments

NikolaiBessonov commented Oct 31, 2024

Bug Report

Description

Steps to Reproduce

Logs

Environment

Question

smira commented Oct 31, 2024

NikolaiBessonov commented Nov 1, 2024

smira commented Nov 1, 2024

NikolaiBessonov commented Nov 8, 2024

smira commented Nov 8, 2024