-
Notifications
You must be signed in to change notification settings - Fork 182
refresh GCP tokens if <55 mins of life left #72
Conversation
PTAL @mikedanese |
Codecov Report
@@ Coverage Diff @@
## master #72 +/- ##
==========================================
+ Coverage 93.51% 93.55% +0.03%
==========================================
Files 11 11
Lines 972 977 +5
==========================================
+ Hits 909 914 +5
Misses 63 63
Continue to review full report at Codecov.
|
config/kube_config.py
Outdated
@@ -32,7 +32,7 @@ | |||
from .config_exception import ConfigException | |||
from .dateutil import UTC, format_rfc3339, parse_rfc3339 | |||
|
|||
EXPIRY_SKEW_PREVENTION_DELAY = datetime.timedelta(minutes=5) | |||
MINIMUM_TOKEN_TIME_REMAINING = datetime.timedelta(minutes=55) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does the metadata server do? If the metadata server also caches tokens, but refreshes them 10 seconds before expiry, then will this call the metadata server on every request for about 55 minutes?
Also can you explain the bug since I'm a python noob?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
accounts.google.com provisions tokens with a max validity duration of 1 hour. I think the GCE metadata server provisions tokens valid for 30 minutes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this would cause constant token refreshing if the token validity life is <55 minutes. also this python client is for generic use, not limited to GCP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does the metadata server do? If the metadata server also caches tokens, but refreshes them 10 seconds before expiry, then will this call the metadata server on every request for about 55 minutes?
@mikedanese IIRC the metadata server lazily mints new tokens when they've expired, but I'm definitely not an expert
Also can you explain the bug since I'm a python noob?
Not really a 'bug' per se, it's just that long-running operations can very easily result in 401s/403s due to expired tokens. For example, by default, dockerd only pulls 3 image layers in parallel, meaning that especially large images being downloaded over real-world networks can fail half-way through.
i think this would cause constant token refreshing if the token validity life is <55 minutes. also this python client is for generic use, not limited to GCP.
@yliaog So only apply this minimum freshness to the GCP tokens? Works for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you can make it a parameter, instead of a const. so you can set it to the most appropriate value for your use case (for GCP?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yliaog It doesn't need to be shared between GCP and other credentials, I don't think, since the 1h lifespan on GCP tokens is unlikely to change. If the existing logic suffices for other credentials, I can limit this change to GCP credentials.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, it seems like clients of this lib won't necessarily know a-priori which token source will be used: https://github.com/kubernetes-client/python-base/blob/master/config/kube_config.py#L179
Seems like this issue might be sufficiently addressed by limiting the change to the GCP token logic. There might be a separate task of factoring out the Authenticator logic into a plugin interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method won't work well for the GCE compute metadata source where there is an extra layer of caching. This will result in a call to the metadata server per request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method won't work well for the GCE compute metadata source where there is an extra layer of caching. This will result in a call to the metadata server per request.
Doesn't seem like it's possible to fully mitigate the issue, then, since AFAIK we can't coerce a refresh. We'd just have to hope that one retry is sufficient to pick up a token with enough lifespan left to complete a request.
From the docs at https://cloud.google.com/compute/docs/access/create-enable-service-accounts-for-instances#applications:
The metadata server caches access tokens until they have 60 seconds of remaining time before they expire.
Looks like the only 'good' mitigation would be to reload the kube-config and retry requests on 401/403?
This doesn't solve problem with long-living applications. I suggest to use a thread to refresh the token periodically in the background. |
@tomplus this isn't intended to completely solve the problem, just reduce the likelihood that it will be encountered.
What would be the periodicity? Would automatically refreshing & retrying on 401/403 be insufficient? |
Signed-off-by: Jake Sanders <jsand@google.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with @yliaog. I think this mitigation is specific to GCP currently and it can be a optional parameter to be set to the most appropriate value.
@dekkagaijin Refreshing on api invocation error (401/403) would be better but it requires wrapping the generated api calls. Ref similar problem in OIDC token refresh: kubernetes-client/python#492
@@ -32,7 +32,8 @@ | |||
from .config_exception import ConfigException | |||
from .dateutil import UTC, format_rfc3339, parse_rfc3339 | |||
|
|||
EXPIRY_SKEW_PREVENTION_DELAY = datetime.timedelta(minutes=5) | |||
EXPIRY_TIME_SKEW = datetime.timedelta(minutes=5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: EXPIRY_SKEW_PREVENTION_DELAY
sounds more clear to me in explaining the purpose of the timedelta. Maybe document these constants and update
python-base/config/kube_config_test.py
Line 36 in 78472de
# should be less than kube_config.EXPIRY_SKEW_PREVENTION_DELAY |
Can we change this so that we don't have a threshold, and that we just refresh when the token is expired (or make the skew something like 5 seconds)? That seems better than what we do now and doesn't need to track the metadata server behavior. |
Bump, got bit by this today. Any idea what needs to happen to move the needle on this? |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Partially mitigates #59
Signed-off-by: Jake Sanders jsand@google.com