Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lua land OCSP stapling improvements #5397

Closed
7 tasks
ElvinEfendi opened this issue Apr 18, 2020 · 13 comments
Closed
7 tasks

Lua land OCSP stapling improvements #5397

ElvinEfendi opened this issue Apr 18, 2020 · 13 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@ElvinEfendi
Copy link
Member

ElvinEfendi commented Apr 18, 2020

Since #5133 and #5387 ingress-nginx supports OCSP stapling (to replace vanilla Nginx's SSL stapling).

In this issue I'd like us to collect user feedback and track list of things to do to improve the OCSP support further:

  • Eagerly fetch OCSP responses for certificates to make sure first request can be stapled
  • Improve documentation
  • Cache better using nextUpdate attribute. This is blocked on OCSP nextupdate openresty/lua-resty-core#296
  • Use a lock in fetch_and_cache_ocsp_response to avoid a scenario where multiple Nginx workers send request to OCSP responder for the same certificate
  • Do negative caching in fetch_and_cache_ocsp_response. We might want to differentiate negative caching between validation failures and HTTP request failures - I think each should be handled differently. HTTP request failures is likely to be intermittent and should be cached for short term, while validation failures are long term and should be negatively cached for longer. We can even go further and differentiate between validation failure errors - for example validation fails when status is "revoked", in these cases we should cache forever right? In case status is "unauthorized" though I'm not sure - maybe we can cache forever again because it means certificate is borked and has an incorrect OCSP responder URL?
  • The implementation, contrary to vanilla Nginx, does not provide any alternative to ssl_stapling_verify and always passes OCSP_NOVERIFY flag when calling OCSP_basic_verify. There's also a test case for that: https://github.com/openresty/lua-resty-core/blob/1bd74f8f502a9dc0a7099110d0ee184187698d01/t/ocsp.t#L1120. https://www.openssl.org/docs/man1.1.1/man3/OCSP_basic_verify.html explains this further. My understanding is by passing OCSP_NOVERIFY we effectively skip verification of certificate that was used to sign the OCSP response, in other words OCSP response could have been served by an attacker and we would not know. But is this really an issue for OCSP stapling? Browser should do full verification upon receiving the response no?
  • tbd
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 18, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 17, 2020
@wenzong
Copy link
Contributor

wenzong commented Sep 16, 2020

Hi @ElvinEfendi

I am using quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.33.0 and enable ocsp stapling using enable-ocsp: "true" in cm/nginx-configuration

I am also using cert-manager and use letsencrypt issuer, then I put cert-manager.io/cluster-issuer: "letsencrypt" annotation in ingress.

Everything worked as expected at the beginning, cert-manager can generate certificate for my domain and ingress-controller can fetch OCSP Response and do the OCSP Stapling for my domain.


The problem is, when cert-manager renew the certificate, ingress-controller still using the old OCSP Response.

Let's say the certificate's Validity is Not Before: Sep 16 00:00:00 2020 GMT but the OCSP Response is

    Produced At: Sep 13 01:00:00 2020 GMT
    Cert Status: good
    This Update: Sep 13 01:00:00 2020 GMT
    Next Update: Sep 20 01:00:00 2020 GMT

I guess that would cause inconsistence between the certificate and OCSP Response?

  • (fix me if I am wrong)
  • (I lose the environment and can't reproduce it in a short time, so I can't check the certificate serial name between the cert and OCSP Response)

From what I get on hand, Browsers and iOS Native Apps ignore that inconsistence, but Android would complain that Caused by: java.security.cert.CertPathValidatorException: OCSP response does not include a response for a certificate supplied in the OCSP request.(For more background information, I am using gRPC on Android, which underlying using Android Java.`)


Maybe ingress-controller should expire the old OCSP Response when a new certificate is generated? Or it's cert-manager's duty to notify the renewal of certificate?

I will try to understand rootfs/etc/nginx/lua/certificate.lua's behavior but i guess better put the information here first and let's people discuss about it.

@ElvinEfendi
Copy link
Member Author

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Sep 16, 2020
@ElvinEfendi
Copy link
Member Author

ElvinEfendi commented Sep 16, 2020

@wenzong thanks for the report. I think that's because we use k8s secret's UID as a cache key. And when cert-manager updates the certificate, the corresponding secret's cache is remains unchanged.

If you traceback uid from

success, err, forcible = ocsp_response_cache:set(uid, ocsp_response, expiry)
you will see what I'm talking about.

I think the fix would be to use certificate's serial number as the cache key. Or if Lua does not expose any API to get serial number, we can also simply md5 hash the certificate payload itself and use it as a cache key.

I'd be happy to guide you if you want to take a stab at this.

@wenzong
Copy link
Contributor

wenzong commented Sep 17, 2020

@wenzong thanks for the report. I think that's because we use k8s secret's UID as a cache key. And when cert-manager updates the certificate, the corresponding secret's cache is remains unchanged.

If you traceback uid from

success, err, forcible = ocsp_response_cache:set(uid, ocsp_response, expiry)

you will see what I'm talking about.
I think the fix would be to use certificate's serial number as the cache key. Or if Lua does not expose any API to get serial number, we can also simply md5 hash the certificate payload itself and use it as a cache key.

I'd be happy to guide you if you want to take a stab at this.

Thanks for your reply.

I see, so ingress-controller watches secret's changes, and post certificate information via /configuration/servers

statusCode, _, err := nginx.NewPostStatusRequest("/configuration/servers", "application/json", configuration)

Lua receive it in

local success, set_err, forcible = certificate_servers:set(server, uid)
and save into ngx.shared.DICT

There are three relationships:

  • POST by ingress-controller
    • server name -> UID: ngx.shared.certificate_servers
    • UID -> certificate: ngx.shared.certificate_data
  • Fetch And Cache
    • UID -> ocsp response: ngx.shared.ocsp_response_cache

Coming request will use hostname(server name) to find certificate and fetch_and_cache_ocsp_response to fill in certificate and do stapling.


I guess better change UID to something worldwide unique for one certificate? Serial number is only unique for one specific CA. The poster suggest either use issuer name + serial number or certificate's thumbprint, the latter is actually The SHA-1 hash of the DER encoding of the certificate.

I noticed that there is a PemSHA

PemSHA string `json:"pemSha"`
in SSLCert struct. Maybe we can use that instead of UID, then the relationship is like:

  • server name -> PemSHA: ngx.shared.certificate_servers
  • PemSHA -> certificate: ngx.shared.certificate_data
  • PemSHA -> ocsp response: ngx.shared.ocsp_response_cache

Any thought?

@ElvinEfendi
Copy link
Member Author

in SSLCert struct. Maybe we can use that instead of UID, then the relationship is like:

The problem with this is that we will pollute ngx.shared.certificate_data. Every certificate renewal will create a new entry in the dictionary without deleting the previous one. The benefit of using UID is that, it remains unchanged and we end up doing update instead of insert/add when a certificate is renewed.

--

Maybe in

local success, set_err, forcible = certificate_data:set(uid, cert)
you can check if certificate is changed, then bust the OCSP cache corresponding to the uid under consideration?

@wenzong
Copy link
Contributor

wenzong commented Sep 18, 2020

in SSLCert struct. Maybe we can use that instead of UID, then the relationship is like:

The problem with this is that we will pollute ngx.shared.certificate_data. Every certificate renewal will create a new entry in the dictionary without deleting the previous one. The benefit of using UID is that, it remains unchanged and we end up doing update instead of insert/add when a certificate is renewed.

--

Maybe in

local success, set_err, forcible = certificate_data:set(uid, cert)

you can check if certificate is changed, then bust the OCSP cache corresponding to the uid under consideration?

You are right, do expire cache is more reasonable.

@wenzong
Copy link
Contributor

wenzong commented Sep 18, 2020

@ElvinEfendi I just rethink the implementation and realized that there is a potential concurrency problem.

So in /configurations/servers:

  • after ocsp_response_cache:delete(uid)
  • before certificate_data:set(uid)

Since the two steps are not atomic operation, there is a small chance that some nginx worker missed the cache and read the old certificate, then fetch and cache the old ocsp_response again.

I will try to solve it tomorrow.


The PR is #6200

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 18, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 17, 2021
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

4 participants