Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Let lifetime watcher retry until end of lease #11008

Conversation

andrejvanderzee
Copy link
Contributor

@andrejvanderzee andrejvanderzee commented Feb 25, 2021

This pull request retries renewals on error until the end of the lease (minus grace), instead of abruptly restarting the lifetime watcher (and revoking tokens/secrets as a result).

The problem is described in detail here: #10918.

@ncabatoff

@andrejvanderzee andrejvanderzee force-pushed the retry_until_end_of_lease branch from 4d529ea to 70e9d61 Compare February 26, 2021 14:30
@vercel vercel bot temporarily deployed to Preview – vault-storybook February 26, 2021 14:30 Inactive
@vercel vercel bot temporarily deployed to Preview – vault February 26, 2021 14:30 Inactive
@andrejvanderzee
Copy link
Contributor Author

@ncabatoff Any change that someone can look at this? We do have a production problem and our current workaround is a rolling restart of all our k8s production deployments every week, just to make sure we do not run into this vault-agent sidecar edge-case. This MR makes vault agents more resilient against temporary problems not being able to renew leases of tokens and secrets.

@andrejvanderzee
Copy link
Contributor Author

@ncabatoff @jefferai

Can someone please take a look at this Vault Agent resilience issue? Our workaround now is doing rolling upgrades of all our Kubernetes deployments when it nears the TTL of the vault token. Insane.

The current implementation of func (r *LifetimeWatcher) doRenew() error is just not OK, I hope you can acknowledge that?

I think the whole doRenew() requires a proper refactoring, but this PR at least makes Vault Agent more resilient with a patch on the error/problem-path only, which causes vault token and secrets to be unnecessarily renewed to soon!

@andrejvanderzee
Copy link
Contributor Author

@ncabatoff @jefferai

Ping...

@andrejvanderzee
Copy link
Contributor Author

@ncabatoff @jefferai

Any chance that this will be picked up?

@ncabatoff
Copy link
Collaborator

Hey @andrejvanderzee,

I'm sorry to have left you hanging for so long. I've added a story to the current sprint to tackle this issue. I think this PR is on the right track, though I would like to see some tests. Maybe factor out the first bit of doRenew so that we can provide our own artificial renewFunc when testing. Would you like to work on those, or shall I?

@ncabatoff ncabatoff added this to the 1.7.2 milestone Apr 21, 2021
@andrejvanderzee
Copy link
Contributor Author

andrejvanderzee commented Apr 22, 2021

Hi @ncabatoff, no problem at all, we are just very happy that it is being picked up now, as it occasionally is causing problems in our prd environments. I won't have much time, so if you could add the tests that would be wonderful :-)

As a side note: To mitigate the problem, we put VAULT_MAX_TRIES on 12. We have noticed that every common error will end up in a "context deadline exceeded", because retrying is handled inside the Vault client and after 12 retries you exceed VAULT_CLIENT_TIMEOUT. It would be nice if somehow we could see the real error.

Thanks for putting this in your sprint :-)

@ncabatoff
Copy link
Collaborator

Closing this in favour of #11445.

@ncabatoff ncabatoff closed this Apr 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants