Unable to start up / Liveness probe failed #4898

Dyllaann · 2020-01-08T12:03:37Z

Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see https://kubernetes.io/docs/tasks/debug-application-cluster/troubleshooting/.):

What keywords did you search in NGINX Ingress controller issues before filing this one? (If you have found any duplicates, you should instead reply there.):
liveness, readiness, store, event, ingress, startup

Is this a BUG REPORT or FEATURE REQUEST? (choose one):
BUG REPORT

NGINX Ingress controller version:
0.26.2

Kubernetes version (use kubectl version):
v1.14.8

Environment:

Cloud provider or hardware configuration: AKS
OS (e.g. from /etc/os-release): Ubuntu 16.04.6 LTS
Kernel (e.g. uname -a):
Install tools: Helm v2.13.1
Others:

What happened:
On random occasions, pods tend to be unable to start up the nginx controller.
After 30 seconds (as configured in the Liveness/Readiness probe initialDelaySeconds) the status changes to
Readiness probe failed: Get http://10.244.1.29:10254/healthz: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

(Domain name is blurred out.)

What you expected to happen:
To start up succesfully, like some pods do. The liveness/readiness state should be okay after the configured interval.

(Domain name is blurred out.)

How to reproduce it (as minimally and precisely as possible):
I'm unsure as to how to reproduce this issue.
One way is to kill the pod and wait for another one to pop back up and see if that one fails.
It might be worth noting that for pods that do start up succesfully, the event.go:255] Event(v1.ObjectReference{Kind:"ConfigMap", takes over 30 seconds, after which the controller starts up fine.
The current set up does not give a good feeling on stability of the controllers, as a new pod might not start up succesfully.

Anything else we need to know:
The log lines about Get http://127.0.0.1:10246/nginx_status: dial tcp 127.0.0.1:10246: connect: connection refused can be 'ignored', as they are the result of the Prometheus ServiceMonitor coming along while the pod is not started properly.

The text was updated successfully, but these errors were encountered:

devops-corgi · 2020-01-23T18:19:58Z

Can confirm, seeing this exact issue. Nginx restarting like crazy. Example:

NAME                                                     READY   STATUS    RESTARTS   AGE
ingress-nginx-ingress-controller-5qkd4                   1/1     Running   10         13h
ingress-nginx-ingress-controller-jbzkw                   1/1     Running   21         13h
ingress-nginx-ingress-controller-qqrrh                   1/1     Running   104        13h
ingress-nginx-ingress-controller-r96vj                   1/1     Running   6          13h
ingress-nginx-ingress-controller-wn877                   1/1     Running   7          13h

Running in EKS with Kubernetes version 1.14.8. Nginx version 0.27.1 (latest stable at the moment of writing this).

Edit: I've been able to narrow it down and correlate nearly 100% to when a node spikes to 100% CPU.

Restarts for an nginx pod running on a spiking node:

CPU utilization:

According to #4505 this should have been fixed in #4487 (I confirmed and #4487 is included since 0.26.something), but I'm still seeing the exact same thing.

Stono · 2020-01-24T14:38:42Z

We are seeing exactly this on 0.27.1, we're going to roll back to 0.27.0 as i dont remember seeing it on that

aledbf · 2020-01-24T14:43:39Z

According to #4505 this should have been fixed in #4487

the node where the ingress controller pod is running is using 100% of the CPU?
You have configured the limit range? (https://github.com/kubernetes/ingress-nginx/blob/master/deploy/static/mandatory.yaml#L280-L294)

The issue when the CPU utilization in the node is 100% the ingress controller start failing the probes because of the lack of time assigned to the pod.

Stono · 2020-01-24T14:55:30Z

Hey,
So we removed the pod limits (and don't have a limit range) which would correlate to this problem (our logic is we never want nginx to be throttled). Adding the pod limits back has totally resolved the issue.

However our nodes were nowhere near 100%... cpu usage!

aledbf · 2020-01-24T14:58:15Z

So we removed the pod limits (and don't have a limit range) which would correlate to this problem (our logic is we never want nginx to be throttled).

I share the idea "our logic is we never want nginx to be throttled," but the issue here is that any spike in CPU can lead to probes failure/s.

However our nodes were nowhere near 100%... cpu usage!

Interesting.

aledbf · 2020-01-26T02:07:48Z

Closing. Fixed in #4959
Test image quay.io/kubernetes-ingress-controller/nginx-ingress-controller-amd64:goroutines

Please reopen if you can reproduce the issue with this new image.

Stono · 2020-01-26T10:35:15Z

Can confirm I have tested with this and no longer get cpu spikes when shutting down.

aledbf closed this as completed Jan 26, 2020

Dyllaann mentioned this issue Apr 10, 2020

Still unable to start up / unexpected restarts #5345

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to start up / Liveness probe failed #4898

Unable to start up / Liveness probe failed #4898

Dyllaann commented Jan 8, 2020 •

edited

Loading

devops-corgi commented Jan 23, 2020 •

edited

Loading

Stono commented Jan 24, 2020 •

edited

Loading

aledbf commented Jan 24, 2020

Stono commented Jan 24, 2020 •

edited

Loading

aledbf commented Jan 24, 2020 •

edited

Loading

aledbf commented Jan 26, 2020

Stono commented Jan 26, 2020

Unable to start up / Liveness probe failed #4898

Unable to start up / Liveness probe failed #4898

Comments

Dyllaann commented Jan 8, 2020 • edited Loading

What keywords did you search in NGINX Ingress controller issues before filing this one? (If you have found any duplicates, you should instead reply there.): liveness, readiness, store, event, ingress, startup

devops-corgi commented Jan 23, 2020 • edited Loading

Stono commented Jan 24, 2020 • edited Loading

aledbf commented Jan 24, 2020

Stono commented Jan 24, 2020 • edited Loading

aledbf commented Jan 24, 2020 • edited Loading

aledbf commented Jan 26, 2020

Stono commented Jan 26, 2020

Dyllaann commented Jan 8, 2020 •

edited

Loading

What keywords did you search in NGINX Ingress controller issues before filing this one? (If you have found any duplicates, you should instead reply there.):
liveness, readiness, store, event, ingress, startup

devops-corgi commented Jan 23, 2020 •

edited

Loading

Stono commented Jan 24, 2020 •

edited

Loading

Stono commented Jan 24, 2020 •

edited

Loading

aledbf commented Jan 24, 2020 •

edited

Loading