Skip to content
This repository was archived by the owner on Mar 13, 2022. It is now read-only.

refresh GCP tokens if <55 mins of life left #72

Closed
wants to merge 1 commit into from

Conversation

dekkagaijin
Copy link

Partially mitigates #59

Signed-off-by: Jake Sanders jsand@google.com

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jul 3, 2018
@dekkagaijin
Copy link
Author

PTAL @mikedanese

@codecov-io
Copy link

codecov-io commented Jul 3, 2018

Codecov Report

Merging #72 into master will increase coverage by 0.03%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #72      +/-   ##
==========================================
+ Coverage   93.51%   93.55%   +0.03%     
==========================================
  Files          11       11              
  Lines         972      977       +5     
==========================================
+ Hits          909      914       +5     
  Misses         63       63
Impacted Files Coverage Δ
config/kube_config.py 89.96% <100%> (+0.19%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 78472de...7978ef8. Read the comment docs.

@@ -32,7 +32,7 @@
from .config_exception import ConfigException
from .dateutil import UTC, format_rfc3339, parse_rfc3339

EXPIRY_SKEW_PREVENTION_DELAY = datetime.timedelta(minutes=5)
MINIMUM_TOKEN_TIME_REMAINING = datetime.timedelta(minutes=55)
Copy link

@mikedanese mikedanese Jul 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the metadata server do? If the metadata server also caches tokens, but refreshes them 10 seconds before expiry, then will this call the metadata server on every request for about 55 minutes?

Also can you explain the bug since I'm a python noob?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accounts.google.com provisions tokens with a max validity duration of 1 hour. I think the GCE metadata server provisions tokens valid for 30 minutes.

Copy link
Contributor

@yliaog yliaog Jul 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this would cause constant token refreshing if the token validity life is <55 minutes. also this python client is for generic use, not limited to GCP.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the metadata server do? If the metadata server also caches tokens, but refreshes them 10 seconds before expiry, then will this call the metadata server on every request for about 55 minutes?

@mikedanese IIRC the metadata server lazily mints new tokens when they've expired, but I'm definitely not an expert

Also can you explain the bug since I'm a python noob?

Not really a 'bug' per se, it's just that long-running operations can very easily result in 401s/403s due to expired tokens. For example, by default, dockerd only pulls 3 image layers in parallel, meaning that especially large images being downloaded over real-world networks can fail half-way through.

i think this would cause constant token refreshing if the token validity life is <55 minutes. also this python client is for generic use, not limited to GCP.

@yliaog So only apply this minimum freshness to the GCP tokens? Works for me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you can make it a parameter, instead of a const. so you can set it to the most appropriate value for your use case (for GCP?)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yliaog It doesn't need to be shared between GCP and other credentials, I don't think, since the 1h lifespan on GCP tokens is unlikely to change. If the existing logic suffices for other credentials, I can limit this change to GCP credentials.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, it seems like clients of this lib won't necessarily know a-priori which token source will be used: https://github.com/kubernetes-client/python-base/blob/master/config/kube_config.py#L179

Seems like this issue might be sufficiently addressed by limiting the change to the GCP token logic. There might be a separate task of factoring out the Authenticator logic into a plugin interface.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method won't work well for the GCE compute metadata source where there is an extra layer of caching. This will result in a call to the metadata server per request.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method won't work well for the GCE compute metadata source where there is an extra layer of caching. This will result in a call to the metadata server per request.

Doesn't seem like it's possible to fully mitigate the issue, then, since AFAIK we can't coerce a refresh. We'd just have to hope that one retry is sufficient to pick up a token with enough lifespan left to complete a request.
From the docs at https://cloud.google.com/compute/docs/access/create-enable-service-accounts-for-instances#applications:

The metadata server caches access tokens until they have 60 seconds of remaining time before they expire.

Looks like the only 'good' mitigation would be to reload the kube-config and retry requests on 401/403?

@mikedanese
Copy link

cc @mbohlool @yliaog

@tomplus
Copy link
Member

tomplus commented Jul 3, 2018

This doesn't solve problem with long-living applications. I suggest to use a thread to refresh the token periodically in the background.

@dekkagaijin
Copy link
Author

@tomplus this isn't intended to completely solve the problem, just reduce the likelihood that it will be encountered.

I suggest to use a thread to refresh the token periodically in the background.

What would be the periodicity?

Would automatically refreshing & retrying on 401/403 be insufficient?

Signed-off-by: Jake Sanders <jsand@google.com>
Copy link
Member

@roycaihw roycaihw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @yliaog. I think this mitigation is specific to GCP currently and it can be a optional parameter to be set to the most appropriate value.

@dekkagaijin Refreshing on api invocation error (401/403) would be better but it requires wrapping the generated api calls. Ref similar problem in OIDC token refresh: kubernetes-client/python#492

@@ -32,7 +32,8 @@
from .config_exception import ConfigException
from .dateutil import UTC, format_rfc3339, parse_rfc3339

EXPIRY_SKEW_PREVENTION_DELAY = datetime.timedelta(minutes=5)
EXPIRY_TIME_SKEW = datetime.timedelta(minutes=5)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: EXPIRY_SKEW_PREVENTION_DELAY sounds more clear to me in explaining the purpose of the timedelta. Maybe document these constants and update

# should be less than kube_config.EXPIRY_SKEW_PREVENTION_DELAY
if we want to change the name.

@mikedanese
Copy link

Can we change this so that we don't have a threshold, and that we just refresh when the token is expired (or make the skew something like 5 seconds)? That seems better than what we do now and doesn't need to track the metadata server behavior.

@ffledgling
Copy link

Bump, got bit by this today. Any idea what needs to happen to move the needle on this?

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 18, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 17, 2019
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants