Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to change cluster.endpoint without downtime #9609

Open
Tracked by #9899
NikolaiBessonov opened this issue Oct 31, 2024 · 5 comments
Open
Tracked by #9899

Unable to change cluster.endpoint without downtime #9609

NikolaiBessonov opened this issue Oct 31, 2024 · 5 comments

Comments

@NikolaiBessonov
Copy link

Bug Report

Unable to authenticate after changing cluster.controlPlane.endpoint in machineconfig

Description

After updating the cluster.controlPlane.endpoint to point to controlplane-3, authentication from kube-apiserver fails. It seems that the issue is related to the absence of the --service-account-issuer argument, which should contain the new VIP address. This issue is likely occurring because the nodes are still referencing the old load balancer endpoint, resulting in a mismatch. But there are no possibility to set up additional param.

Steps to Reproduce

1.	Set up an external load balancer balancing traffic between three control plane nodes on port 6443.
2.	Point cluster.controlPlane.endpoint to this load balancer.
3.	Add a VIP and assign it to interface **network.interfaces.interface[0].vip.ip.**
4.	Change the **cluster.controlPlane.endpoint** on one of the nodes
5.	Check kube-apiserver logs on the node, where you changed that

Logs

kube-apiserver on the node, where you changed cluster.endpoint:
authentication.go:73] "Unable to authenticate the request" err="invalid bearer token"

Environment

  • Talos version: v1.7.5
  • Kubernetes version: 1.30.4
  • Platform: linux/amd64

Question

Until you fix it, is there any way to change cluster.endpoint?

@smira
Copy link
Member

smira commented Oct 31, 2024

I'm not quite sure what the question is, service-account-issuer is equal to the controlplane endpoint in Talos.

By downtime I guess you mean that service account tokens stop working? (these are used only by workload pods)

As no communication in the cluster between components will be broken if you change the endpoint, a simple pod restart for those using service accounts will be sufficient.

@NikolaiBessonov
Copy link
Author

@smira not quite that.

For example if I put new address(vip) to endpoint on one of the controlplane node - api server stops working, because can't authenticate. I think if I change all endpoints on all my controlplane nodes, some components will require restart(such as cilium cni components. It also can't auth on controlplane with new endpoint), and it leads to downtime, until restart all components.

But if there was support of adding additional param "service-account-issuer" - where I could specify additional loadbalancer(old) on nodes, it would be without any errors and downtime. Similar to point 8 "Migration from kubeadm. Step-by-Step guide" in docs

@smira
Copy link
Member

smira commented Nov 1, 2024

Yes, service-account-issuer might be done better, but I guess it has nothing to do with Cilium.

First of all, Cilium should be configured to use Talos KubePrism endpoint - that's way better than using actual cluster endpoint.

I guess what happens is that chainging endpoint re-rolls kube-apiserver certificate, and old/new certificates don't match for you, which can be solved by updading certSANs, but once again service-token-issuer should be made configurable, but your issue is something else.

@NikolaiBessonov
Copy link
Author

@smira Sory for delayed response. We're testing it and applied to production cluster. Changing the endpoints(on all three control planes) involves restarting all pods that are connected in any way to the KubeAPI server, such as Cilium, Cert-manager, ingress-controller etc...
If you have a simple script, which find all serviceAccounts and their pods, it's will be faster and take at least 1 minute of downtime.
Unfortunately, this is the fastest way I found.
I have no more questions. We can close the issue, but it will be better if you add the ability to do that without downtime and restart all pods

@smira
Copy link
Member

smira commented Nov 8, 2024

I think the issue itself makes sense, and what you would like to see is to have additional values for service-account-issuers, that is previous controlplane endpoint - so that tokens are considered to be valid.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants