Activator not updating healthy pods when probe fails #13531

Wouter0100 · 2022-12-06T14:36:33Z

Recently we had some issues with Kubernetes clusters networking being flaky (or, not sure yet - we have not identified the root cause of these issues). During these networking issues, this issue with the Activator surfaced and we were able to diagnose it better, as we've seen it randomly in the past as well.

Randomly, a specific pod would fail its probe from the Activator due to a networking issue

{"severity":"WARNING","timestamp":"2022-12-04T23:28:31.80861786Z","logger":"activator","caller":"net/revision_backends.go:342","message":"Failed probing pods","commit":"9402a71-dirty","knative.dev/controller":"activator","knative.dev/pod":"activator-7cccb78c69-pfrnk","knative.dev/key":"production/engine-nl-00165","curDests":{"ready":"100.64.196.67:8012,100.64.138.118:8012,100.64.138.8:8012,100.64.139.182:8012,100.64.189.10:8012,100.64.194.204:8012,100.64.194.234:8012,100.64.196.40:8012,100.64.111.196:8012,100.64.192.191:8012,100.64.194.187:8012,100.64.196.39:8012,100.64.193.141:8012,100.64.195.218:8012,100.64.111.24:8012,100.64.139.66:8012,100.64.139.95:8012,100.64.189.25:8012,100.64.193.242:8012,100.64.195.124:8012,100.64.196.76:8012","notReady":""},"error":"error roundtripping http://100.64.111.196:8012/healthz: context deadline exceeded"}

During this period, other pods that started terminating wouldn't be properly terminated. They would hang in the Terminating state until the graceful termination period was over, causing more traffic to be lost due to the sudden loss of a pod. The hanging of these pods in Terminating state seemed to be the result of the Activator still sending traffic to these terminating pods. After investigation, it showed that if we would remove the pod with the failing probe - the pods would terminate properly right after.

I would expect that the Activator would continue updating healthy and unhealthy pods when probing some pods fails, but this seem not to be the case.

What version of Knative?

1.8.0

Expected Behavior

The activator updates the healthy pods when a pod is not responding to the probe.

Actual Behavior

The activator does not update the healthy pods when a pod is not responding to the probe.

Steps to Reproduce the Problem

Start multiple pods.
Let a pod fail its probe.
Terminate other pods.

The terminating pods will still receive traffic, until they're forcefully removed by the termination grace period or by terminating the pod that has a failing probe.

The text was updated successfully, but these errors were encountered:

Kirkirillka · 2023-01-12T13:33:49Z

Also encounter the same problem.

Randomly a new revision gets to work, but suddenly can be out of activator routing. Next time the probe succeed, the activator does not update routing either.

skonto · 2023-05-04T13:36:39Z

@Wouter0100 hi, could you past logs for the failure by setting debug level at the activator side? It might help to see what pods are considered ready and why they are not removed (in general the status activator sees).

andrew-su · 2023-09-05T21:07:10Z

/assign @andrew-su

Taking a look.

Wouter0100 · 2023-09-05T21:09:29Z

Sorry @skonto I missed your message. If there's anything needed, I'm more then happy to help. Unfortunately though, we did not have this issue longer periods of time anymore in production after we improved our cluster's networking.

We still think we have it from time to time for a couple of seconds, but are unable to diagnose that at those times.

andrew-su · 2023-09-06T18:44:30Z

Was your revision all healthy right before the network issues?

dprotaso · 2023-09-15T18:26:10Z

This might be addressed by the following PRs

#14347
#14303

andrew-su · 2023-09-18T20:42:24Z

I was unable to reproduce this issue (tested on 1.11).

I did the following to attempt simulating the scenario described in the issue.

Block traffic for pods that will be created. (on my cluster the ips are sequential)
Send a burst of traffic to spin up new pods.
See the roundtripping error from probing.
Unblock traffic for pods.
See all traffic get routed correctly

andrew-su · 2023-10-12T17:35:56Z

With the fixes that Dave mentioned, this issue may be resolved. We should close this and if it comes back to reopen or create a new issue.

/close

knative-prow · 2023-10-12T17:35:59Z

@andrew-su: Closing this issue.

In response to this:

With the fixes that Dave mentioned, this issue may be resolved. We should close this and if it comes back to reopen or create a new issue.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Wouter0100 added the kind/bug Categorizes issue or PR as related to a bug. label Dec 6, 2022

ReToCode added the triage/accepted Issues which should be fixed (post-triage) label Mar 8, 2023

dprotaso added this to the v1.11.0 milestone Mar 30, 2023

ReToCode added triage/needs-user-input Issues which are waiting on a response from the reporter and removed triage/accepted Issues which should be fixed (post-triage) labels May 22, 2023

dprotaso modified the milestones: v1.11.0, v1.12.0 Aug 16, 2023

dprotaso added this to Serving Milestones Aug 16, 2023

knative-prow bot assigned andrew-su Sep 5, 2023

knative-prow bot closed this as completed Oct 12, 2023

github-project-automation bot moved this to Done in Serving Milestones Oct 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Activator not updating healthy pods when probe fails #13531

Activator not updating healthy pods when probe fails #13531

Wouter0100 commented Dec 6, 2022 •

edited

Loading

Kirkirillka commented Jan 12, 2023

skonto commented May 4, 2023 •

edited

Loading

andrew-su commented Sep 5, 2023

Wouter0100 commented Sep 5, 2023

andrew-su commented Sep 6, 2023

dprotaso commented Sep 15, 2023

andrew-su commented Sep 18, 2023

andrew-su commented Oct 12, 2023

knative-prow bot commented Oct 12, 2023

Activator not updating healthy pods when probe fails #13531

Activator not updating healthy pods when probe fails #13531

Comments

Wouter0100 commented Dec 6, 2022 • edited Loading

What version of Knative?

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Kirkirillka commented Jan 12, 2023

skonto commented May 4, 2023 • edited Loading

andrew-su commented Sep 5, 2023

Wouter0100 commented Sep 5, 2023

andrew-su commented Sep 6, 2023

dprotaso commented Sep 15, 2023

andrew-su commented Sep 18, 2023

andrew-su commented Oct 12, 2023

knative-prow bot commented Oct 12, 2023

Wouter0100 commented Dec 6, 2022 •

edited

Loading

skonto commented May 4, 2023 •

edited

Loading