Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handle api server blips #3043

Closed
slaupster opened this issue Sep 5, 2018 · 6 comments
Closed

handle api server blips #3043

slaupster opened this issue Sep 5, 2018 · 6 comments

Comments

@slaupster
Copy link

NGINX Ingress controller version: 0.17.1

Kubernetes version (use kubectl version):1.10.7

Environment: Any

  • Cloud provider or hardware configuration: N/A
  • OS (e.g. from /etc/os-release): N/A
  • Kernel (e.g. uname -a): N/A
  • Install tools: N/A
  • Others: N/A

What happened:
When the API server is unavailable, nginx ingress stops working, believing no services are available. This causes a total ingress outage when config update happens, but only the api server is unavailable.
Everything in the data plane is fine.

What you expected to happen:
Nginx Ingress should know the underlying k8 client cannot connect to the api server and temporarily disable config update. This is not ideal and there is nothing to do other than get the api server back, but its better than a total outage. Once connection to API server is restored and successful inform cycle has run, re-enable the config update.

How to reproduce it (as minimally and precisely as possible):
Set up an ingress, stop/inhibit the API server, wait for Informers to update (with nothing), observe that the ingress no longer works, yet actual pods and services are still active and viable.

Anything else we need to know: no

@aledbf
Copy link
Member

aledbf commented Sep 5, 2018

@slaupster we disable the resync of the informers #2634
If the controller is already running, then this should not be an issue. That said, the controller will not start if we cannot reach the apiserver (there is nothing we can do about this)

@slaupster
Copy link
Author

slaupster commented Sep 5, 2018

Thanks for the reply @aledbf

#2634 made it into 0.16.0 - I've hit this issue more than once with 0.17.1.

Logs look like
ingress.log

Nginx Ingress Pods were running happily for days before and days since, so it recovers fine.

           - /nginx-ingress-controller
           - --default-backend-service={{ .Values.namespace }}/default-http-backend
           - --tcp-services-configmap={{ .Values.namespace }}/tcp-configmap
           - --configmap={{ .Values.namespace }}/nginx-configuration
           - --enable-dynamic-configuration=true
           - --watch-namespace={{ .Values.namespace }}
           - --update-status=false

@aledbf
Copy link
Member

aledbf commented Sep 5, 2018

E0827 21:13:39.275400 8 reflector.go:205] k8s.io/ingress-nginx/internal/ingress/controller/store/store.go:172: Failed to list *v1beta1.Ingress: Get https://172.21.0.1:443/apis/extensions/v1beta1/namespaces//ingresses?limit=500&resourceVersion=0: dial tcp 172.21.0.1:443: connect: connection timed out

This is expected. The informers (sync mechanism from client-go) detect connections issues with the apiserver. The content of the informers (services, configmap, endpoints, secrets) should be there.

W0827 21:25:41.767916 8 controller.go:359] Service does not have any active Endpoint

This is should not happen. Let me see if I can reproduce locally

@aledbf
Copy link
Member

aledbf commented Sep 8, 2018

@slaupster I cannot reproduce the issue you are describing. Please check the gist https://gist.github.com/aledbf/5a24605f2083558b2d3be2b014c43c44

Scenarios:
1.

  • single ingress
  • short unavailability of apiserver
  • 500 ingresses
  • short unavailability of apiserver
  • multiple unavailabilities (minutes to more than an hourt) of apiserver

@aledbf
Copy link
Member

aledbf commented Sep 8, 2018

@slaupster also, when the server returned there wasn't a single reload. From your logs it seems that you have connectivity issues with the master and some ingress/service changed?

@aledbf
Copy link
Member

aledbf commented Sep 8, 2018

Closing. Please reopen if you can provide a reproducible scenario of the issue you described.

@aledbf aledbf closed this as completed Sep 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants