-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Activator not updating healthy pods when probe fails #13531
Comments
Also encounter the same problem. Randomly a new revision gets to work, but suddenly can be out of activator routing. Next time the probe succeed, the activator does not update routing either. |
@Wouter0100 hi, could you past logs for the failure by setting debug level at the activator side? It might help to see what pods are considered ready and why they are not removed (in general the status activator sees). |
/assign @andrew-su Taking a look. |
Sorry @skonto I missed your message. If there's anything needed, I'm more then happy to help. Unfortunately though, we did not have this issue longer periods of time anymore in production after we improved our cluster's networking. We still think we have it from time to time for a couple of seconds, but are unable to diagnose that at those times. |
Was your revision all healthy right before the network issues? |
I was unable to reproduce this issue (tested on 1.11). I did the following to attempt simulating the scenario described in the issue. Block traffic for pods that will be created. (on my cluster the ips are sequential) |
With the fixes that Dave mentioned, this issue may be resolved. We should close this and if it comes back to reopen or create a new issue. /close |
@andrew-su: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Recently we had some issues with Kubernetes clusters networking being flaky (or, not sure yet - we have not identified the root cause of these issues). During these networking issues, this issue with the Activator surfaced and we were able to diagnose it better, as we've seen it randomly in the past as well.
Randomly, a specific pod would fail its probe from the Activator due to a networking issue
During this period, other pods that started terminating wouldn't be properly terminated. They would hang in the
Terminating
state until the graceful termination period was over, causing more traffic to be lost due to the sudden loss of a pod. The hanging of these pods in Terminating state seemed to be the result of the Activator still sending traffic to these terminating pods. After investigation, it showed that if we would remove the pod with the failing probe - the pods would terminate properly right after.I would expect that the Activator would continue updating healthy and unhealthy pods when probing some pods fails, but this seem not to be the case.
What version of Knative?
1.8.0
Expected Behavior
The activator updates the healthy pods when a pod is not responding to the probe.
Actual Behavior
The activator does not update the healthy pods when a pod is not responding to the probe.
Steps to Reproduce the Problem
The terminating pods will still receive traffic, until they're forcefully removed by the termination grace period or by terminating the pod that has a failing probe.
The text was updated successfully, but these errors were encountered: