-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pod Termination handling kicks in before the ingress controller has had time to process #106476
Comments
/sig provider-aws |
@nirnanaaa: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/sig cloud-provider |
Blog that describes the problem at; https://medium.com/flant-com/kubernetes-graceful-shutdown-nginx-php-fpm-d5ab266963c2 I especially like the chart. It illuminates the problem in K8s; there is no kube-proxy<->kubelet communication (and there shouldn't be!). They both react independently on kube-api updates. |
Indeed. (the part in their article about sigterm is very missleading, because it doesn't matter how well your app handles sigterm, if new traffic will still arrive after your app shutdown was completed, that's the "Practice. Potential problems with graceful shutdown" in their article). I totally get that there is not supposed to be a link, don't get me wrong, most of these errors could - and probably should - be solved on client side, but unfortunately it's very hard for users to understand the behavior and mechanics. They're just used to these kinds of dependencies. I just feel a I do think the mechanic of |
I don't think this can be done "by the means of kube-api" so to speak. A pod can get closer by monitoring it's own endpoints, but beside being quite complex it will still not know if all kube-proxys has updated the actual load-balancing. To introduce some status for (removed) endpoints that is updated when all kube-proxy instances has updated load-balancing, I find horrible. Just imagine the cluster wide sync and all possible fault cases 😧 IMO this falls into the "service mesh" domain. They monitor actual connections I think (circuit breaking?), but I am not very familiar with service meshs I must admit. |
/cc @kishorj |
/assign rikatz |
/assign bowei |
/triage-accepted |
EndpointSlice API now supports terminating condition, wonder if ingress controllers can be updated to leverage this for graceful termination of endpoints? |
As @uablrek described, syncing state from all kube-proxies on every change to every endpoint of every service isn't feasible. I'm not sure terminating endpoints helps much, if the core problem is that the Ingress controller or proxy is out-to-lunch or otherwise not getting updates. |
I feel that the actual problem is that pods just get their termination signals way too early in this process, right ? Not sure what ingress controllers or kube proxy should do about that. If a container process is down it doesn’t matter if kube-proxy or the controller stops sending traffic there eventually - all traffic in between these events will still end up hitting an already dead target. |
the process is not parallel is sequential and async kill pod -> pod goes not ready -> endpoint controller receives event pod is not ready -> updates the endpoints -> ingress controller/kube-proxy receive and event endpoints has changed -> ... |
true, but shouldn't the pod be able stay alive until that event has been propagated and processed (just as a preStop hook kinda does)? This would ensure that the container doesn't even enter sigterm until kube-proxy/$controller had time to remove the endpoint. |
how do the pod know that? :)
I see your point, you want to make the process completely synchronous, but is not how Kubernetes works https://kubernetes.io/docs/concepts/architecture/controller/#controller-pattern , oversimplifying you have a bunch of controllers that sync the current state to the desired state, and are eventually consistent 🦄 |
(fixing #106476 (comment)) /triage accepted |
I've spent a lot of time debugging this issue in AWS. One particularly problematic use case is with network load balancers. They always take at least 2 minutes to deregister a pod in IP mode from when they receive a request via AWS' API to remove the pod. During this time, the NLB will still send new TCP connections to the pod. The AWS Load Balancer controller would be able to detect when a pod is actually completely deregistered. So would other ingress controllers. So, I would love if we had a "termination gate" or similar, which could be added by a controller, similar to readiness gates. Actually, the AWS Load balancer controller already makes use of readiness gates, because NLB/ALB registration is just as slow. It's weird that Kubernetes doesn't do anything to help with deregistration as well. If we had a termination gate, this would fix the problem, because we would complete the loop: k8s would say it wants to terminate a pod, controllers would have time to update and stop sending traffic, and then the pod would be told "okay, it's time to cleanup". |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
/lifecycle frozen |
/remove-lifecycle stale |
This issue has not been updated in over 1 year, and should be re-triaged. You can:
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/ /remove-triage accepted |
The issue is still relevant for many of us. Not being able to shutdown correctly without using a hack of guessing the time for Related issues (in aws lb controller and ingress-nginx):
I agree with this. |
No such KEP exists, to the best of my knowledge. It would require either a whole new lifecycle phase (deleted but waiting) or perhaps a new preStop mechanism like "wait for upstream LBs", which then asks the question "how do I know which LBs to wait for?" |
Pods in Kubernetes endpoints are expected to shut-down 'gracefully' after receiving SIGTERM - we should keep accepting new connections for a while. This is because Kubernetes updates Service endpoints and sends SIGTERM to pods *in parallel*. See kubernetes/kubernetes#106476 for more detail.
Pods in Kubernetes endpoints are expected to shut-down 'gracefully' after receiving SIGTERM - we should keep accepting new connections for a while. This is because Kubernetes updates Service endpoints and sends SIGTERM to pods *in parallel*. See kubernetes/kubernetes#106476 for more detail.
Pods in Kubernetes endpoints are expected to shut-down 'gracefully' after receiving SIGTERM - we should keep accepting new connections for a while. This is because Kubernetes updates Service endpoints and sends SIGTERM to pods *in parallel*. See kubernetes/kubernetes#106476 for more detail.
What happened?
When a pod is entering its Terminating state, it will receive a signal, asking it kindly to finish up work after which kubernetes will proceed deleting the pod.
At the same time that the pod starts terminating, an ingress controller will receive the updated endpoints object, which will start removing the pod from the list of targets in the load balancer, that traffic could be sent to.
Both of these processes - the signal handling at the kubelet level and the removal of the Pods IP from the list of endpoints - are decoupled from one another and the SIGTERM might have been handled before, or at the same time, that the target in the target group is being processed.
As result the ingress controller might still send traffic to targets, which are still in its endpoints, but have properly shut down already. This might result in dropped connections, as the LB is still trying to send requests to the properly shutdown pod. The LB will in-turn reply with 5xx responses.
What did you expect to happen?
no traffic being dropped during shutdown.
The SIGTERM should only start after the ingress controller/LB has removed the target from the target group. Readiness gates work pretty good for pod startup/rollout but lack support during pod deletion.
How can we reproduce it (as minimally and precisely as possible)?
This is a very theoretical problem, which is very hard to reproduce:
Anything else we need to know?
We've been relying on Pod-Graceful-Drain, which unfortunately intercepts and breaks k8s internals.
You can achieve a pretty good result as well using a
sleep
aspreStop
, but that's not reliable at all - due to the fact that it's just a guessing game if your traffic will be drained after X seconds - and requires statically linked binaries to be mounted in each container or the existence of sleep in the operating system.I also opened up an issue on the Ingress Controllers repo.
Kubernetes version
Cloud provider
OS version
Install tools
Container runtime (CRI) and and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
The text was updated successfully, but these errors were encountered: