Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Activator not updating healthy pods when probe fails #13531

Closed
Wouter0100 opened this issue Dec 6, 2022 · 9 comments
Closed

Activator not updating healthy pods when probe fails #13531

Wouter0100 opened this issue Dec 6, 2022 · 9 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/needs-user-input Issues which are waiting on a response from the reporter
Milestone

Comments

@Wouter0100
Copy link

Wouter0100 commented Dec 6, 2022

Recently we had some issues with Kubernetes clusters networking being flaky (or, not sure yet - we have not identified the root cause of these issues). During these networking issues, this issue with the Activator surfaced and we were able to diagnose it better, as we've seen it randomly in the past as well.

Randomly, a specific pod would fail its probe from the Activator due to a networking issue

{"severity":"WARNING","timestamp":"2022-12-04T23:28:31.80861786Z","logger":"activator","caller":"net/revision_backends.go:342","message":"Failed probing pods","commit":"9402a71-dirty","knative.dev/controller":"activator","knative.dev/pod":"activator-7cccb78c69-pfrnk","knative.dev/key":"production/engine-nl-00165","curDests":{"ready":"100.64.196.67:8012,100.64.138.118:8012,100.64.138.8:8012,100.64.139.182:8012,100.64.189.10:8012,100.64.194.204:8012,100.64.194.234:8012,100.64.196.40:8012,100.64.111.196:8012,100.64.192.191:8012,100.64.194.187:8012,100.64.196.39:8012,100.64.193.141:8012,100.64.195.218:8012,100.64.111.24:8012,100.64.139.66:8012,100.64.139.95:8012,100.64.189.25:8012,100.64.193.242:8012,100.64.195.124:8012,100.64.196.76:8012","notReady":""},"error":"error roundtripping http://100.64.111.196:8012/healthz: context deadline exceeded"}

During this period, other pods that started terminating wouldn't be properly terminated. They would hang in the Terminating state until the graceful termination period was over, causing more traffic to be lost due to the sudden loss of a pod. The hanging of these pods in Terminating state seemed to be the result of the Activator still sending traffic to these terminating pods. After investigation, it showed that if we would remove the pod with the failing probe - the pods would terminate properly right after.

I would expect that the Activator would continue updating healthy and unhealthy pods when probing some pods fails, but this seem not to be the case.

What version of Knative?

1.8.0

Expected Behavior

The activator updates the healthy pods when a pod is not responding to the probe.

Actual Behavior

The activator does not update the healthy pods when a pod is not responding to the probe.

Steps to Reproduce the Problem

  • Start multiple pods.
  • Let a pod fail its probe.
  • Terminate other pods.

The terminating pods will still receive traffic, until they're forcefully removed by the termination grace period or by terminating the pod that has a failing probe.

@Wouter0100 Wouter0100 added the kind/bug Categorizes issue or PR as related to a bug. label Dec 6, 2022
@Kirkirillka
Copy link

Also encounter the same problem.

Randomly a new revision gets to work, but suddenly can be out of activator routing. Next time the probe succeed, the activator does not update routing either.

@ReToCode ReToCode added the triage/accepted Issues which should be fixed (post-triage) label Mar 8, 2023
@dprotaso dprotaso added this to the v1.11.0 milestone Mar 30, 2023
@skonto
Copy link
Contributor

skonto commented May 4, 2023

@Wouter0100 hi, could you past logs for the failure by setting debug level at the activator side? It might help to see what pods are considered ready and why they are not removed (in general the status activator sees).

@ReToCode ReToCode added triage/needs-user-input Issues which are waiting on a response from the reporter and removed triage/accepted Issues which should be fixed (post-triage) labels May 22, 2023
@dprotaso dprotaso modified the milestones: v1.11.0, v1.12.0 Aug 16, 2023
@andrew-su
Copy link
Member

/assign @andrew-su

Taking a look.

@Wouter0100
Copy link
Author

Sorry @skonto I missed your message. If there's anything needed, I'm more then happy to help. Unfortunately though, we did not have this issue longer periods of time anymore in production after we improved our cluster's networking.

We still think we have it from time to time for a couple of seconds, but are unable to diagnose that at those times.

@andrew-su
Copy link
Member

Was your revision all healthy right before the network issues?

@dprotaso
Copy link
Member

This might be addressed by the following PRs

#14347
#14303

@andrew-su
Copy link
Member

I was unable to reproduce this issue (tested on 1.11).

I did the following to attempt simulating the scenario described in the issue.

Block traffic for pods that will be created. (on my cluster the ips are sequential)
Send a burst of traffic to spin up new pods.
See the roundtripping error from probing.
Unblock traffic for pods.
See all traffic get routed correctly

@andrew-su
Copy link
Member

With the fixes that Dave mentioned, this issue may be resolved. We should close this and if it comes back to reopen or create a new issue.

/close

@knative-prow knative-prow bot closed this as completed Oct 12, 2023
@knative-prow
Copy link

knative-prow bot commented Oct 12, 2023

@andrew-su: Closing this issue.

In response to this:

With the fixes that Dave mentioned, this issue may be resolved. We should close this and if it comes back to reopen or create a new issue.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/needs-user-input Issues which are waiting on a response from the reporter
Projects
Development

No branches or pull requests

6 participants