Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Object_store] min_ttl is too high for GKE tokens #6625

Closed
mwylde opened this issue Oct 24, 2024 · 5 comments · Fixed by #6638
Closed

[Object_store] min_ttl is too high for GKE tokens #6625

mwylde opened this issue Oct 24, 2024 · 5 comments · Fixed by #6638
Labels

Comments

@mwylde
Copy link
Contributor

mwylde commented Oct 24, 2024

Describe the bug

When using object_store on a GKE pod with workload credentials, we see a huge volume of requests to the metadata endpoint to refresh the token (this appears in the log as a stream of "fetching token from metadata server" lines, within 1-2 ms of each other). This can overload the metadata service, preventing future work within the service.

This is caused by the implementation of the TokenCache:

if let Some(cached) = locked.as_ref() {
match cached.expiry {
Some(ttl) if ttl.checked_duration_since(now).unwrap_or_default() > self.min_ttl => {
return Ok(cached.token.clone());
}
None => return Ok(cached.token.clone()),
_ => (),
}
}
let cached = f().await?;
let token = cached.token.clone();
*locked = Some(cached);

The token cache is supposed to prevent multiple requests to fetch the token by reusing a cached token. However, if the token is close to expiry (within the min_ttl time) it will attempt to refresh it, then set the new token in the cache.

In this case, what's happening is that the GKE metadata service returns the same token for every call up until ~5 minutes before expiry, at which point it generates a new token with an expiry of 1 hour. But min_ttl is hard-coded to 5 minutes (300 seconds).

This creates the potential for a race condition, where if a high volume of calls come into the object_store at ~5 minutes until expiry, they may each:

  1. Lock the mutex
  2. Observe the cached token is near expiry
  3. Get a new token (which is the same as the old token, with the same <5min expiry time)
  4. Save that in the cache
  5. Release the mutex lock

which is what we observe from our logs. If enough requests come in one will overload the service, leaving the mutex locked and preventing any further use of the object_store. For reasons I don't quite understand, the requests don't seem to ever time out, leaving the store stuck until we restart the service.

To Reproduce

Run a service (for example https://github.com/ArroyoSystems/arroyo) on GKE with a workload identity writing to GCS. Make a high volume of parallel requests to the object store, wait an hour, see that many requests are made to the metadata service.

Expected behavior

Only one request should be made to the metadata service.

Proposed solutions

A simple fix is to just reduce the min_ttl for GCS to <=4 minutes. However, I think it's dangerous to rely on the exact behavior of the token generation in a generic subsystem like the token cache. A better solution might look like an asynchronous refresh process that's kicked off when the min_ttl is hit, and runs (with appropriate backoff) until it successfully gets a token with expiry > min_ttl. This would also avoid the latency impact of doing token fetching within the request itself

@mwylde mwylde added the bug label Oct 24, 2024
@mwylde mwylde changed the title min_ttl is too short for GKE tokens [Object_store] min_ttl is too short for GKE tokens Oct 24, 2024
@mwylde mwylde changed the title [Object_store] min_ttl is too short for GKE tokens [Object_store] min_ttl is too high for GKE tokens Oct 24, 2024
@tustvold
Copy link
Contributor

tustvold commented Oct 24, 2024

Hmm... This is kind of unfortunate, we could make the min ttl configurable but this is as you say not an ideal solution. Another option might be to throttle concurrent metadata requests, but again this is just moving the problem.

Can this behaviour of workload identity be configured perhaps, it does seem pretty bizarre to me? It effectively means you can't reliably get a fresh credential nor one that is valid for more than 5 minutes, which seems fairly limiting? Even if we did the fetching as an asynchronous job, we'd run into similar issues.

Perhaps there is some query parameter we could add to force it to generate a new token?

I'm going to change this to an enhancement, as it isn't really a bug, but an enhancement to workaround a limitation of some other component.

@tustvold tustvold added enhancement Any new improvement worthy of a entry in the changelog help wanted and removed bug labels Oct 24, 2024
@mwylde
Copy link
Contributor Author

mwylde commented Oct 24, 2024

I might not have been clear. When you first request the token, it has a TTL of 1 hour. However, the token is cached locally by the metadata service until it has 5 minutes of time left at which point it will generate a new one:

Access tokens expire after a short period of time. The metadata server caches access tokens until they have 5 minutes of remaining time before they expire. If tokens are unable to be cached, requests that exceed 50 queries per second might be rate limited. Your applications must have a valid access token for their API calls to succeed.

So if you just reduce the min_ttl to under five minutes, it will probably be ok. There doesn't seem to be a way to force the metadata service to give you a new token (although I'm far from a GCP expert so there might be something I've missed in the docs).

I do think this is a bug, because currently on GCP you can end up in a situation where object_store is stuck on a lock and unable to make progress, apparently indefinitely. And because the min_ttl isn't configurable from outside the library, there is no workaround except modifying the code.

I think the key thing that needs to be fixed is that the cache will happily overwhelm the metadata service with requests when the token being returned expires in <min_ttl.

@tustvold
Copy link
Contributor

tustvold commented Oct 25, 2024

object_store is stuck on a lock and unable to make progress

Is this the case, or is it stuck trying to get a token from a crashed metadata server?

Either way I'd be happy to review a PR to make the min-ttl configurable, and perhaps lower the default to 4 minutes for GCP

@tustvold
Copy link
Contributor

I've filed #6627 which is related to this, and ultimately why the crate is trying so hard to get a fresh token.

I've also changed this back to a bug, as the distinction isn't really important

@tustvold tustvold added bug and removed enhancement Any new improvement worthy of a entry in the changelog labels Oct 25, 2024
@alamb alamb added the object-store Object Store Interface label Nov 16, 2024
@alamb
Copy link
Contributor

alamb commented Nov 16, 2024

label_issue.py automatically added labels {'object-store'} from #6638

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants