Skip to content
This repository has been archived by the owner on Apr 17, 2019. It is now read-only.

[nginx-ingress-controller] Nginx aborts reloads w/ certificate errors #1314

Closed
fcvarela opened this issue Jul 5, 2016 · 5 comments
Closed

Comments

@fcvarela
Copy link
Contributor

fcvarela commented Jul 5, 2016

The ingress controller writes certificates to /etc/nginx-ssl on a goroutine and reload nginx on another. On a busy nginx controller, I've observed that nginx throws errors like:

2016/07/05 11:32:56 [emerg] 13#13: PEM_read_bio_X509("/etc/nginx-ssl/xxx-yyy") failed (SSL: error:0906D066:PEM routines:PEM_read_bio:bad end line)

After this happens any rc/pod/deployment restart that results in pods getting new IPs will make the entire upstream unavailable (503 on nginx) as the aborted reload renders it unaware of those changes.

Manual inspection of nginx.conf shows the upstream does have the new ips, that the certificates and keys are correct.

A second manual reload solves the issue.

This is caused by writing certificates/keys using ioutil (async) while nginx is trying to read them. I have a working and verified patch that changes it to tempfiles and atomic renames.

@aledbf
Copy link
Contributor

aledbf commented Jul 5, 2016

@fcvarela do you have some step to replicate this error?
(I'm trying to get this error in order to test your PR)

@fcvarela
Copy link
Contributor Author

fcvarela commented Jul 5, 2016

@aledbf Unfortunately no, it requires setting up a few hundred ingress rules spanning a few dozen hosts all using tls secrets. I set logging level to --v=5 and noticed the nginx error interleaved with logging from ssl.go updating pem files from secrets, indicating the nginx reload signal was triggered while files were being written in a non atomic way, hence the rename from tmp instead of asynchronous write in place.

@aledbf
Copy link
Contributor

aledbf commented Jul 6, 2016

@fcvarela how many certificates are you trying to use?
I've created this script that creates 1000 certificates and ingress rules and I cannot reproduce this issue

@fcvarela
Copy link
Contributor Author

fcvarela commented Jul 6, 2016

Not that many, we are using 12 certificates covering a ~200 hosts (one of the certs is a wildcard) and ~500 ingress rules. My impression is that once the pem files are generated all will work fine. But if you have continuous activity generating ingress rules and adding hosts or containers crash-looping (and therefore registering/unregistering endpoints as upstream servers), then the pem generation will run continuously.

This problem happens when nginx is reloading from a previous iteration and new pem files are being written over existing ones (same content). You should see the problem occurring if you keep the ingress generation loop going indefinitely or if you crashloop the destination backend.

In any case, there is no locking in place to prevent nginx from reloading while pem files are being written, so they should be written atomically. That guarantees nginx will always "see" a complete pem file.

@fcvarela
Copy link
Contributor Author

fcvarela commented Jul 6, 2016

Here's the error output (--v=5)
I0706 12:07:18.444747 1 controller.go:1036] endpoints found: [{10.244.2.8 8080 0 0}]
I0706 12:07:18.451327 1 ssl.go:99] DNS [*.XXX.net *.XXX.net] 2
I0706 12:07:18.453132 1 ssl.go:99] DNS [*.XXX.net *.XXX.net] 2
2016/07/06 12:07:18 [emerg] 13#13: PEM_read_bio_X509_AUX("/etc/nginx-ssl/NAMESPACE-SECRETNAME.pem") failed (SSL: error:0906D06C:PEM routines:PEM_read_bio:no start line:Expecting: TRUSTED CERTIFICATE)
I0706 12:07:18.453897 1 ssl.go:99] DNS [*.XXX.net *.XXX.net] 2
I0706 12:07:18.454973 1 ssl.go:99] DNS [*.XXX.net *.XXX.net] 2

A few seconds later, a manual ngins -s reload works fine. That certificate was being written by ssl.go during the previous nginx -s reload.

The patched version is running on our cluster since I submitted the PR and we haven't seen that happen again.

bprashanth added a commit that referenced this issue Jul 6, 2016
Addresses #1314 [nginx-ingress-controller ssl nginx reload abort]
aledbf pushed a commit to aledbf/contrib that referenced this issue Nov 10, 2016
Removed comment to be consistent w/ rest of code

Fixes typo and string concat
aledbf pushed a commit to aledbf/contrib that referenced this issue Nov 10, 2016
Addresses kubernetes-retired#1314 [nginx-ingress-controller ssl nginx reload abort]
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants