-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regular crash of ngnix ingress controller pods after upgrade to 0.20.0 image #3457
Comments
@EcaterinaGr please post the start of the log. Maybe you are being affected by you should follow Azure/AKS#435 |
@aledbf , are you asking for the start of events logs or of the pod itself? Grabbed more logs from the docker container itself before it was killed: |
Hi,
|
Related to this: #2833 |
hello Any updates regarding the issue? |
As soon as we add certificates (same creation as on our dev clusters) nginx starts to crash. |
solution for this bug ? plz thanks |
We have been experiencing similar issues where we see:
much more frequently for some of our nginx pods than we did in the past and it is leading to public facing errors, sometimes it seems to manifest as cert errors (we do TLS termination with nginx). I don't know exactly when it started but we started noticing it when we where on 0.20.0 and now we are on 0.22.0 and we are still seeing these issues. Our probes timeout is set to 1s and we also do not seem to see any error in the nginx controller itself. And as of this time we have no idea why this is happening. We are running on AWS with a kops cluster on k8s 1.11.6. |
@Globegitter how does memory and cpu utilization look like during that time? It could be network related too. If it is feasible you can also try to downgrade to the earlier ingress version where it worked well for you and confirm that it is because if ingress-nginx. |
@ElvinEfendi You mean specifically of these nginx pods? Or you mean node/cluster wide? But yeah we still have to investigate further and we have been serving quite a bit more traffic since over a month ago. I just found this issues seemed interestingly fitting but could very much just be coincidence. Unfortunately we are using a 0.20.0 specific feature so downgrading is going to be a bit more difficult, but depending on how this continues we'll definitely be doing some investigation on that next week. |
I guess both. I'm suggesting that to make sure timeout is not because the pod is too busy doing other things and therefore can not respond health checks timely. |
@ElvinEfendi yeah that is a good pointer: The last 2 hours or so we only ever had on pod crashing a fair bit. and it is that pod that has a much higher cpu usage and memory usage than the other pods. So that is very interesting, as to why that one pod would have a much higher usage than our other 3 pods? Further we also have a cluster wide traffic of <2000req/s, so even if that was just hitting one nginx I would still expect that to be able to respond to health probes. |
@Globegitter maybe you can help us to test this running some scripts :)
This will print something like
This can help us to test what @ElvinEfendi said in the previous comment Edit: maybe I should put this in a k8s Job to help to debug this issue. |
@aledbf Aah yeah that is a good idea on debugging - we just raised the memory limit and the pods are re-rolling. But if the timeout issue persists / pops up again I will make sure to make use of the debug script and post the results here. Edit: And a job for that would really be useful. |
@aledbf it is still happening, here some output:
Some of the 500s mixed in are from when the pod was taken out of service from the k8s probes, but interesting to see that health checks sometimes take up to 3.5s. |
@Globegitter please check the log of the pod with IP |
Just checked, we do not have a logging collector active yet for the nginx logs so only see part of the logs of the previous pod that errored. I do see a few and at the end:
Otherwise it is just 900+ lines of normal json log of requests being made. I can take a look next week at exporting the logs, or tailing them at the same time. Would you expect to see something in the error log? Or just the normal response logging? |
This issue is hard to reproduce, for this reason, I created #3684 to see if we can narrow the scope issue. Basically, I added additional logs in the health check inside the ingress controller to detect where we receive an error. Please, help us using the image Note: for security reasons, the probes do not show the exact cause of the failure unless you increase the log level of kubelet to a value higher than five (this is the reason why I added the additional logs)
|
@aledbf Thanks I will start investigating now, on Friday we could resolve the issue by increasing the number of replicas. It started happening again. Will test this image. |
So, we have some output:
for another pod:
But again that is just from the logs that I could tail, but at least it is giving us something more. Also I am currently tailing all running pods with a grep for |
This is "normal" because the health check starts before nginx. You should see only one of this error in the log.
This is not ok. This output means the ingress controller startup (nginx binary) takes more than ten seconds for the initial sync, which is not normal. |
yep all logs the same:
Here again, I just looked over the logs of all the running pods and jsut grepped for healthcheck. Edit: Ah interesting, I will try and keep an eye out specifically on startup - some of these definitely happen to instances that have been running well for a while but we have not looked at startup "issues" yet. |
@Globegitter please check and post the generated nginx.conf
Is IPV6 enabled? (or only IPV6) |
To those affected by this issue: Please help us to test a fix for this with #3684 using the image The mentioned PR contains a refactoring of the nginx server used for health-check and Lua configuration replacing the TCP port with a unix socket. |
Hi, we were affected by this issue. We saw 150 connections per second would cause our ingress controller to restart. When we checked the resource it was failing health checks just as described here in this issue. We would sometimes see timeouts as high as 10.x seconds. We tested again with: This dev image fully resolved our issue, we are no longer seeing restarts at high connection rates. |
@sslavic yes |
This could improve situation #5832 once released. |
Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see https://kubernetes.io/docs/tasks/debug-application-cluster/troubleshooting/.):
What keywords did you search in NGINX Ingress controller issues before filing this one? (If you have found any duplicates, you should instead reply there.):
Is this a BUG REPORT or FEATURE REQUEST? (choose one):
NGINX Ingress controller version:
0.20.0
Kubernetes version (use
kubectl version
):1.9.6
Environment:
Production
Azure
Ubuntu 16.04
uname -a
):Linux k8s-master-81594228-1 4.13.0-1012-azure Split implementations from generic code #15-Ubuntu SMP Thu Mar 8 10:47:27 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
acs engine, ansible, terraform
What happened:
After upgrading the Nginx controller version from 0.15.0 to 0.20.0, the nginx ingress controller pods are regularly crashing after several timeout messages on the liveness probe. The nginx ingress controller pods are installed on separate VMs as all other pods. We need 0.20.0 version because we want to activate use-forwarded-headers: "false" in nginx config map to avoid the security vulnerability (user forging the headers to bypass the whitelist of nginx).
What you expected to happen:
Stable behavior of nginx ingress controller pods as in version 0.15.0.
How to reproduce it (as minimally and precisely as possible):
Update the image quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.15.0 to
quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.20.0 on existent nginx ingress controller deployment.
Anything else we need to know:
Logs from the events:
2018-11-21 17:24:25 +0100 CET 2018-11-21 17:23:05 +0100 CET 6 nginx-ingress-controller-7d47db4569-9bxtz.1569303cf3aebbba Pod spec.containers{nginx-ingress-controller} Warning Unhealthy kubelet, k8s-dmz-81594228-0 Liveness probe failed: Get http://xx.xx.xx.xx:10254/healthz:: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2018-11-21 17:24:26 +0100 CET 2018-11-21 17:24:26 +0100 CET 1 nginx-ingress-controller-7d47db4569-9bxtz.1569304fae92655c Pod spec.containers{nginx-ingress-controller} Normal Killing kubelet, k8s-dmz-81594228-0 Killing container with id docker://nginx-ingress-controller:Container failed liveness probe.. Container will be killed and recreated.
We have tried to increase the timeoutSeconds on liveness probe to 4s and also to add - --enable-dynamic-configuration=false in the nginx deployment.
With this configuration, the number of timeouts decreased, but after a certain charge from apps on the platform, the timeouts become more regular.
Logs from nginx pods in debug mode and timeout 3sec:
{"log":"E1121 13:30:34.808413 5 controller.go:232] Error getting ConfigMap "kube-system/udp-services": no object matching key "kube-system/udp-services" in local store\n","stream":"stderr","time":"2018-11-21T13:30:34.818557076Z"}
{"log":"I1121 13:30:37.500168 5 main.go:158] Received SIGTERM, shutting down\n","stream":"stderr","time":"2018-11-21T13:30:37.501123038Z"}
{"log":"I1121 13:30:37.500229 5 nginx.go:340] Shutting down controller queues\n","stream":"stderr","time":"2018-11-21T13:30:37.501167238Z"}
{"log":"I1121 13:30:37.500276 5 nginx.go:348] Stopping NGINX process\n","stream":"stderr","time":"2018-11-21T13:30:37.501203538Z"}
The text was updated successfully, but these errors were encountered: