-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evicted pod do not release controller ConfigMap lock Bug 1749620 #1874
Comments
HI @ilanuriel, Note that when a pod is flagged as evict the API will try to remove it. Also, before removing the pod it will ensure that all pre-conditions to allowing it to happens. Shows that because of your code caused a memory leak it was running out of resources. I'd like to suggest you check the following documentation.
Please, let us know if the above information attends your expectations and if we can close this ticket or if has anything else that you are looking for to be done here. |
Thanks Camila, |
Hi @ilanuriel, In the following k8s documents you can have a better idea over how k8s API works in these scenarios and how you can set up your environment to dealing better with this kind of disruption. https://kubernetes.io/docs/concepts/workloads/pods/disruptions/ Also, could you please let us know why do you believe that it is a bug in the operator-framework? |
Hi Camila, |
Evicted pods not getting deleted/garbage collected (from an unresponsive or partitioned node) is an edge case with the SDK's leader-for-life approach. See the original proposal for more context on this issue: I'm not sure if it's something that we can easily fix given that the nature of the lock is not leased/expiry based. Worth discussing further though. At the very least we should have this case clearly documented in the leader pkg godocs. |
Thank you Haseeb, |
Since it blocks the CNV operators' development, I create a bug in BZ for this issue: https://bugzilla.redhat.com/show_bug.cgi?id=1749620 |
@ilanuriel Unfortunately that approach would cause the leader-for-life election guarantees to fail in the scenario that the leader pod is evicted due to it being on an unresponsive or unreachable node. In that case, the pod may actually still be running so it would be invalid to delete its lock. However, There are tradeoffs with either approach, so you should consider carefully which makes the most sense for your use case. |
Thinking about this more, it seems there are two ways a pod can be evicted:
For 1) there are two scenarios: To cover 1a., we could have the leader node monitor itself, checking to see if it ever gets evicted. If so, it could stop all of its controllers, delete the lock, and exit. This would allow an operator replica on a non-partitioned node to gain the lock and become the leader. For 1b., if the leader is no longer able to communicate with the API server, it would no longer be able to monitor itself or delete the lock. In this case, it seems like there is still the possibility of a deadlock. To cover 2, we could look into catching the signal from kubelet that terminates the container, which would allow us to stop all controllers, delete the lock, and exit cleanly. Note that for 1a and 2. It is important that all of the controllers are stopped before the lock is deleted to guarantee that another operator replica is not able to run the same controllers simultaneously. @shawn-hurley @hasbro17 @estroz @camilamacedo86 Anything I'm getting wrong or any different perspectives or ideas for other approaches? Any ideas for the 1b case? |
@joelanford won't the deadlock only occur in case 1b for |
@estroz Not necessarily. If the node is unreachable the apiserver can't tell the kubelet to delete the pods after https://kubernetes.io/docs/concepts/architecture/nodes/#condition
So deadlock forever is possible on partitioned/unreachable nodes. I don't think we should try to handle 1b. 1a is just a best case scenario for 1 so I don't know if it's worth optimizing for. @joelanford I need to read more on what are the sequence of events when a pod is evicted but from the termination lifecycle of a pod it doesn't seem that we need to handle the case of the pod being unresponsive to SIGTERM from the kubelet. https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods
In the case of no network partitions, once the graceperiod expires the pod will be killed via SIGKILL and will be removed from the API server. Going back to @ilanuriel's original issue:
|
Edit: The pods do get killed it seems. Okay so evicted pods will get killed(all containers stopped and resources reclaimed) but not deleted(i.e deletion timestamp is not set) from the API it seems. This is true for pods managed by Deployments but not for Daemonsets or other controllers. And it seems intentional. See kubernetes/kubernetes#54525 (comment) kubernetes/kubernetes#54525 (comment) and kubernetes/kubernetes#54525 (comment)
@ilanuriel I don't know if you can confirm this anymore for your evicted pods to see if they were marked with |
I think it was marked with failed |
I would like to emphasize that even if there are cases where the framework won't fix, please be advised that the newly created POD could not become the leader and after sometime stopped trying and remained in a RUNNING state, where it is essentially was doing nothing. Only by deleting BOTH the evicted POD AND the new POD was I able to get things OK again, as the new POD stopped trying to become the leader, so deleting ONLY the evicted POD was NOT enough. From a DevOps perspective, the new POD looked perfect: it was running and no report that this operator will be doing nothing was generated. |
This looks like a dup of this issue that was reported back in April. See: #1305 |
I agree that the simplest way to address this is for the non-leader to check the status of the leader pod and delete its lock if that pod's phase is
That sounds like a separate bug. Do you have any additional findings or info on why the new Pod stopped trying to become the leader? |
When the leader pod is evicted but not deleted, the leader lock configmap is not garbage collected. With this patch, a pod attempting to become the leader is able to delete the evicted pod, triggering garbage collection and allowing it to create a new leader lock. operator-framework#1305 Fixes operator-framework#1305 operator-framework#1874 Fixes operator-framework#1874
Before this patch, when the leader pod is evicted but not deleted, the leader lock configmap is not garbage collected and subsequent operaters can never become leader. With this patch, an operator attempting to become the leader is able to delete the evicted operator pod, triggering garbage collection and allowing leader election to continue. Sometimes, evicted operator pods will remain, even with this patch. This occurs when the leader operator pod is evicted, a new operator pod is created on the same node. In this case, the new pod will also be evicted. When an operator pod is created on a non-failing node, leader election will delete only the evicted leader pod, leaving any evicted operator pods that were not the leader. To replicate the evicted state, I used a `kind` cluster with 2 worker nodes with altered kubelet configuration: `memory.available`: Set about 8Gi less than the host machine `avail Mem` from `top`. `evictionPressureTransitionPeriod: 5s`: Allows kubelet to evict earlier. With these settings in place (and kubelet restarted), a memory-explosion function is triggered, which results in the eviction of all pods on that node. operator-framework#1305 Fixes operator-framework#1305 operator-framework#1874 Fixes operator-framework#1874
Before this patch, when the leader pod is evicted but not deleted, the leader lock configmap is not garbage collected and subsequent operaters can never become leader. With this patch, an operator attempting to become the leader is able to delete the evicted operator pod, triggering garbage collection and allowing leader election to continue. Sometimes, evicted operator pods will remain, even with this patch. This occurs when the leader operator pod is evicted, a new operator pod is created on the same node. In this case, the new pod will also be evicted. When an operator pod is created on a non-failing node, leader election will delete only the evicted leader pod, leaving any evicted operator pods that were not the leader. To replicate the evicted state, I used a `kind` cluster with 2 worker nodes with altered kubelet configuration: `memory.available`: Set about 8Gi less than the host machine `avail Mem` from `top`. `evictionPressureTransitionPeriod: 5s`: Allows kubelet to evict earlier. With these settings in place (and kubelet restarted), a memory-explosion function is triggered, which results in the eviction of all pods on that node. operator-framework#1305 Fixes operator-framework#1305 operator-framework#1874 Fixes operator-framework#1874
Does this issue solved? |
Before this patch, when the leader pod is evicted but not deleted, the leader lock configmap is not garbage collected and subsequent operaters can never become leader. With this patch, an operator attempting to become the leader is able to delete the evicted operator pod, triggering garbage collection and allowing leader election to continue. Sometimes, evicted operator pods will remain, even with this patch. This occurs when the leader operator pod is evicted, a new operator pod is created on the same node. In this case, the new pod will also be evicted. When an operator pod is created on a non-failing node, leader election will delete only the evicted leader pod, leaving any evicted operator pods that were not the leader. To replicate the evicted state, I used a `kind` cluster with 2 worker nodes with altered kubelet configuration: `memory.available`: Set about 8Gi less than the host machine `avail Mem` from `top`. `evictionPressureTransitionPeriod: 5s`: Allows kubelet to evict earlier. With these settings in place (and kubelet restarted), a memory-explosion function is triggered, which results in the eviction of all pods on that node. operator-framework#1305 Fixes operator-framework#1305 operator-framework#1874 Fixes operator-framework#1874
@avarf the patch isn't merged yet, but if you would like to help test it, the the PR is 2210. |
Demo of bugfix: https://www.youtube.com/watch?v=qTzdAYWPdCQ |
The controller pod becomes a leader by acquiring a lock on a configmap resource.
Due to a bug in my code which caused a memory leak, the pod was evicted after a few hours.
However, the evicted pod lock was not released and thus, the new pod could not become a leader.
The only way to fix that was to manually delete the evicted pod, which was locking the resource and delete the new pod which could not lock the resource and gave up after a few tries.
According to Kubernetes documentation, evicted pods locks are deleted as they evict.
The text was updated successfully, but these errors were encountered: