[nginx-ingress-controller] Nginx aborts reloads w/ certificate errors #1314

fcvarela · 2016-07-05T11:45:36Z

The ingress controller writes certificates to /etc/nginx-ssl on a goroutine and reload nginx on another. On a busy nginx controller, I've observed that nginx throws errors like:

2016/07/05 11:32:56 [emerg] 13#13: PEM_read_bio_X509("/etc/nginx-ssl/xxx-yyy") failed (SSL: error:0906D066:PEM routines:PEM_read_bio:bad end line)

After this happens any rc/pod/deployment restart that results in pods getting new IPs will make the entire upstream unavailable (503 on nginx) as the aborted reload renders it unaware of those changes.

Manual inspection of nginx.conf shows the upstream does have the new ips, that the certificates and keys are correct.

A second manual reload solves the issue.

This is caused by writing certificates/keys using ioutil (async) while nginx is trying to read them. I have a working and verified patch that changes it to tempfiles and atomic renames.

The text was updated successfully, but these errors were encountered:

aledbf · 2016-07-05T20:05:19Z

@fcvarela do you have some step to replicate this error?
(I'm trying to get this error in order to test your PR)

fcvarela · 2016-07-05T20:55:18Z

@aledbf Unfortunately no, it requires setting up a few hundred ingress rules spanning a few dozen hosts all using tls secrets. I set logging level to --v=5 and noticed the nginx error interleaved with logging from ssl.go updating pem files from secrets, indicating the nginx reload signal was triggered while files were being written in a non atomic way, hence the rename from tmp instead of asynchronous write in place.

aledbf · 2016-07-06T11:45:55Z

@fcvarela how many certificates are you trying to use?
I've created this script that creates 1000 certificates and ingress rules and I cannot reproduce this issue

fcvarela · 2016-07-06T11:51:27Z

Not that many, we are using 12 certificates covering a ~200 hosts (one of the certs is a wildcard) and ~500 ingress rules. My impression is that once the pem files are generated all will work fine. But if you have continuous activity generating ingress rules and adding hosts or containers crash-looping (and therefore registering/unregistering endpoints as upstream servers), then the pem generation will run continuously.

This problem happens when nginx is reloading from a previous iteration and new pem files are being written over existing ones (same content). You should see the problem occurring if you keep the ingress generation loop going indefinitely or if you crashloop the destination backend.

In any case, there is no locking in place to prevent nginx from reloading while pem files are being written, so they should be written atomically. That guarantees nginx will always "see" a complete pem file.

fcvarela · 2016-07-06T12:11:08Z

Here's the error output (--v=5)
I0706 12:07:18.444747 1 controller.go:1036] endpoints found: [{10.244.2.8 8080 0 0}]
I0706 12:07:18.451327 1 ssl.go:99] DNS [*.XXX.net *.XXX.net] 2
I0706 12:07:18.453132 1 ssl.go:99] DNS [*.XXX.net *.XXX.net] 2
2016/07/06 12:07:18 [emerg] 13#13: PEM_read_bio_X509_AUX("/etc/nginx-ssl/NAMESPACE-SECRETNAME.pem") failed (SSL: error:0906D06C:PEM routines:PEM_read_bio:no start line:Expecting: TRUSTED CERTIFICATE)
I0706 12:07:18.453897 1 ssl.go:99] DNS [*.XXX.net *.XXX.net] 2
I0706 12:07:18.454973 1 ssl.go:99] DNS [*.XXX.net *.XXX.net] 2

A few seconds later, a manual ngins -s reload works fine. That certificate was being written by ssl.go during the previous nginx -s reload.

The patched version is running on our cluster since I submitted the PR and we haven't seen that happen again.

Addresses #1314 [nginx-ingress-controller ssl nginx reload abort]

Removed comment to be consistent w/ rest of code Fixes typo and string concat

Addresses kubernetes-retired#1314 [nginx-ingress-controller ssl nginx reload abort]

bprashanth closed this as completed in c83c272 Jul 6, 2016

bprashanth added a commit that referenced this issue Jul 6, 2016

Merge pull request #1315 from fcvarela/master

a954aad

Addresses #1314 [nginx-ingress-controller ssl nginx reload abort]

aledbf pushed a commit to aledbf/contrib that referenced this issue Nov 10, 2016

Fixes kubernetes-retired#1314

9285335

Removed comment to be consistent w/ rest of code Fixes typo and string concat

aledbf pushed a commit to aledbf/contrib that referenced this issue Nov 10, 2016

Merge pull request kubernetes-retired#1315 from fcvarela/master

a7570a8

Addresses kubernetes-retired#1314 [nginx-ingress-controller ssl nginx reload abort]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[nginx-ingress-controller] Nginx aborts reloads w/ certificate errors #1314

[nginx-ingress-controller] Nginx aborts reloads w/ certificate errors #1314

fcvarela commented Jul 5, 2016 •

edited

Loading

aledbf commented Jul 5, 2016

fcvarela commented Jul 5, 2016 •

edited

Loading

aledbf commented Jul 6, 2016 •

edited

Loading

fcvarela commented Jul 6, 2016 •

edited

Loading

fcvarela commented Jul 6, 2016 •

edited

Loading

[nginx-ingress-controller] Nginx aborts reloads w/ certificate errors #1314

[nginx-ingress-controller] Nginx aborts reloads w/ certificate errors #1314

Comments

fcvarela commented Jul 5, 2016 • edited Loading

aledbf commented Jul 5, 2016

fcvarela commented Jul 5, 2016 • edited Loading

aledbf commented Jul 6, 2016 • edited Loading

fcvarela commented Jul 6, 2016 • edited Loading

fcvarela commented Jul 6, 2016 • edited Loading

fcvarela commented Jul 5, 2016 •

edited

Loading

fcvarela commented Jul 5, 2016 •

edited

Loading

aledbf commented Jul 6, 2016 •

edited

Loading

fcvarela commented Jul 6, 2016 •

edited

Loading

fcvarela commented Jul 6, 2016 •

edited

Loading