Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PVC for Traefik Ingress (prevent LetsEncrypt throttling) #2352

Merged
merged 3 commits into from
Mar 25, 2024

Conversation

kenafoster
Copy link
Contributor

Reference Issues or PRs

What does this implement/fix?

This is an interim fix for #2174 . It sets up a PVC for the traefik ingress pod so that the cert is on persistent storage, meaning that as new pods get recreated, they will not request new certs (which leads to the throttling problem mentioned in the issue).

The "best fix" is cert-manager or another service will handle all these requests. That's being worked on - ref #2336

In AWS, the storage class it uses is EBS-backed and therefore specific to an AZ, which means that this is not truly HA / fault tolerant (the pod will always schedule and launch on a node in the same AZ). However, there are many resources in the kubernetes_services stage that follow this same pattern so this is not introducing a new limitation, and is only intended as an interim fix anyway.

Note I have ONLY tested this in AWS. Need assistance running in other clouds. Specifically, I am not sure that the access_mode will be universal because it is relying on Kubernetes to determine the default storage class and create the volume. This seems to work OK for helm charts that do this within the kubernetes_services stage but this resource gets created in a prior stage, so there could be some dependency/configuration that I am missing.

Put a x in the boxes that apply

  • [ X ] Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds a feature)
  • Breaking change (fix or feature that would cause existing features not to work as expected)
  • Documentation Update
  • Code style update (formatting, renaming)
  • Refactoring (no functional changes, no API changes)
  • Build related changes
  • Other (please describe):

Testing

  • [X ] Did you test the pull request locally?
    Local and AWS cloud
  • Did you add new tests?

Any other comments?

@dcmcand
Copy link
Contributor

dcmcand commented Mar 20, 2024

@marcelovilla @viniciusdc @Adam-D-Lewis whoever has time, can we test this on azure and gcp for @tylergraff ?

@viniciusdc
Copy link
Contributor

Thanks, @kenafoster, for the PR; this looks great. Based on what I see from the local tests and the structure of the PVC, I don't see a problem occurring from the other cloud providers. But I will test that out

@viniciusdc viniciusdc added this to the 2024.3.3 milestone Mar 23, 2024
@viniciusdc
Copy link
Contributor

viniciusdc commented Mar 25, 2024

I just tested this on GCP, and it works as expected; the access level is being handled correctly; my only issue is that if the user needs to update its domain, then we need to manually delete the certs.json file from the PVC, as Traefikdoes not seem to be doing it right now. (before this problem was mitigated with the absence of the file during pod init)

Though this could be easily documented, and a kubectl command to shorten this process can be included as well

@viniciusdc
Copy link
Contributor

I've created an issue in our docs repo to track the details outlined above, I will merge the PR as it is working and address the original issue.

Copy link
Contributor

@viniciusdc viniciusdc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@viniciusdc viniciusdc merged commit ac9ecbc into nebari-dev:develop Mar 25, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

3 participants