Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to start up / Liveness probe failed #4898

Closed
Dyllaann opened this issue Jan 8, 2020 · 7 comments
Closed

Unable to start up / Liveness probe failed #4898

Dyllaann opened this issue Jan 8, 2020 · 7 comments

Comments

@Dyllaann
Copy link

Dyllaann commented Jan 8, 2020

Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see https://kubernetes.io/docs/tasks/debug-application-cluster/troubleshooting/.):

What keywords did you search in NGINX Ingress controller issues before filing this one? (If you have found any duplicates, you should instead reply there.):
liveness, readiness, store, event, ingress, startup

Is this a BUG REPORT or FEATURE REQUEST? (choose one):
BUG REPORT

NGINX Ingress controller version:
0.26.2

Kubernetes version (use kubectl version):
v1.14.8

Environment:

  • Cloud provider or hardware configuration: AKS
  • OS (e.g. from /etc/os-release): Ubuntu 16.04.6 LTS
  • Kernel (e.g. uname -a):
  • Install tools: Helm v2.13.1
  • Others:

What happened:
On random occasions, pods tend to be unable to start up the nginx controller.
After 30 seconds (as configured in the Liveness/Readiness probe initialDelaySeconds) the status changes to
Readiness probe failed: Get http://10.244.1.29:10254/healthz: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
image
(Domain name is blurred out.)

What you expected to happen:
To start up succesfully, like some pods do. The liveness/readiness state should be okay after the configured interval.
image
(Domain name is blurred out.)

How to reproduce it (as minimally and precisely as possible):
I'm unsure as to how to reproduce this issue.
One way is to kill the pod and wait for another one to pop back up and see if that one fails.
It might be worth noting that for pods that do start up succesfully, the event.go:255] Event(v1.ObjectReference{Kind:"ConfigMap", takes over 30 seconds, after which the controller starts up fine.
The current set up does not give a good feeling on stability of the controllers, as a new pod might not start up succesfully.

Anything else we need to know:
The log lines about Get http://127.0.0.1:10246/nginx_status: dial tcp 127.0.0.1:10246: connect: connection refused can be 'ignored', as they are the result of the Prometheus ServiceMonitor coming along while the pod is not started properly.

@devops-corgi
Copy link

devops-corgi commented Jan 23, 2020

Can confirm, seeing this exact issue. Nginx restarting like crazy. Example:

NAME                                                     READY   STATUS    RESTARTS   AGE
ingress-nginx-ingress-controller-5qkd4                   1/1     Running   10         13h
ingress-nginx-ingress-controller-jbzkw                   1/1     Running   21         13h
ingress-nginx-ingress-controller-qqrrh                   1/1     Running   104        13h
ingress-nginx-ingress-controller-r96vj                   1/1     Running   6          13h
ingress-nginx-ingress-controller-wn877                   1/1     Running   7          13h

Running in EKS with Kubernetes version 1.14.8. Nginx version 0.27.1 (latest stable at the moment of writing this).

Edit: I've been able to narrow it down and correlate nearly 100% to when a node spikes to 100% CPU.

Restarts for an nginx pod running on a spiking node:
Screen Shot 2020-01-23 at 12 39 54 PM

CPU utilization:
Screen Shot 2020-01-23 at 12 45 03 PM

According to #4505 this should have been fixed in #4487 (I confirmed and #4487 is included since 0.26.something), but I'm still seeing the exact same thing.

@Stono
Copy link
Contributor

Stono commented Jan 24, 2020

We are seeing exactly this on 0.27.1, we're going to roll back to 0.27.0 as i dont remember seeing it on that

@aledbf
Copy link
Member

aledbf commented Jan 24, 2020

According to #4505 this should have been fixed in #4487

the node where the ingress controller pod is running is using 100% of the CPU?
You have configured the limit range? (https://github.com/kubernetes/ingress-nginx/blob/master/deploy/static/mandatory.yaml#L280-L294)

The issue when the CPU utilization in the node is 100% the ingress controller start failing the probes because of the lack of time assigned to the pod.

@Stono
Copy link
Contributor

Stono commented Jan 24, 2020

Hey,
So we removed the pod limits (and don't have a limit range) which would correlate to this problem (our logic is we never want nginx to be throttled). Adding the pod limits back has totally resolved the issue.

However our nodes were nowhere near 100%... cpu usage!

@aledbf
Copy link
Member

aledbf commented Jan 24, 2020

So we removed the pod limits (and don't have a limit range) which would correlate to this problem (our logic is we never want nginx to be throttled).

I share the idea "our logic is we never want nginx to be throttled," but the issue here is that any spike in CPU can lead to probes failure/s.

However our nodes were nowhere near 100%... cpu usage!

Interesting.

@aledbf
Copy link
Member

aledbf commented Jan 26, 2020

Closing. Fixed in #4959
Test image quay.io/kubernetes-ingress-controller/nginx-ingress-controller-amd64:goroutines

Please reopen if you can reproduce the issue with this new image.

@aledbf aledbf closed this as completed Jan 26, 2020
@Stono
Copy link
Contributor

Stono commented Jan 26, 2020

Can confirm I have tested with this and no longer get cpu spikes when shutting down.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants