-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EKS] [bug] Removed kube-apiservers return 401 Unauthorized instead of closing connection #1810
Comments
I wanted to provide a small update. It seems that since about a week ago, kube-apiservers that are in the process of being removed, still accept client requests while the connection from the kube-apiserver to etcd is already gone. What I'm also seeing is in-flight requests not being gracefully handled. This leads to clients reporting errors. Clients can work around this by retrying failed requests, but it strikes me as "not nice" that kube-apiserver is still accepting requests, even though it can't possibly handle them. As a result, we are now forced to add more and more retry logic to our e2e testing of our EKS clusters. This is a good practice regardless, but I think EKS can improve here. This as well, only started happening around the time of the EKS 1.23 release. |
It would be nice to get an explanation from the EKS team to know whether this is something that must be expected when dealing with the EKS kube-api. It really is a weird situation having to retry 401s to prevent issues with the availability of a rotating/scaling API. |
Summary: 401 would typically not be a retryable error, but a user reported hitting it when they scaled up their cluster, and aws/containers-roadmap#1810 seems to suggest retrying as a workarounds. The downside of retrying on a 401 seems fairly low as well. Open to push-back on this though. Test Plan: Existing BK test coverage of the k8s client
Summary: 401 would typically not be a retryable error, but a user reported hitting it when they scaled up their cluster, and aws/containers-roadmap#1810 seems to suggest retrying as a workarounds. The downside of retrying on a 401 seems fairly low as well. Open to push-back on this though. Test Plan: Existing BK test coverage of the k8s client
Summary: 401 would typically not be a retryable error, but a user reported hitting it when they scaled up their cluster, and aws/containers-roadmap#1810 seems to suggest retrying as a workarounds. The downside of retrying on a 401 seems fairly low as well. Open to push-back on this though. Test Plan: Existing BK test coverage of the k8s client ### Summary & Motivation ### How I Tested These Changes
Around the time of the release of EKS 1.23 we started noticing that EKS is more aggressively scaling out/in its kube-apiservers. We are seeing them being replaced more frequently. What we also notice is that whenever a kube-apiserver is removed (it no longer appears in
kubectl get endpoints kubernetes -n default
) it doesn't close existing connections. Instead, whenever a client makes a request to this removed kube-apiserver over an existing connection, the kube-apiserver returns a 401 Unauthorized. This seems to happen every time a kube-apiserver is scaled down. Applications might not be triggered by this 401 Unauthorized to re-establish their connection to the kube-apiserver. Instead, they might think that certain API resources are not available and act accordingly. This happens for example with the latest release of cilium.I believe that whenever a kube-apiserver is removed as an endpoint, it should also immediately close all of its client connections; forcing the clients to establish a new connection.
Related issue: cilium/cilium#20915
The text was updated successfully, but these errors were encountered: