[Object_store] min_ttl is too high for GKE tokens #6625

mwylde · 2024-10-24T21:15:16Z

Describe the bug

When using object_store on a GKE pod with workload credentials, we see a huge volume of requests to the metadata endpoint to refresh the token (this appears in the log as a stream of "fetching token from metadata server" lines, within 1-2 ms of each other). This can overload the metadata service, preventing future work within the service.

This is caused by the implementation of the TokenCache:

arrow-rs/object_store/src/client/token.rs

Lines 64 to 76 in a9294d7

    
           if let Some(cached) = locked.as_ref() { 
        
               match cached.expiry { 
        
                   Some(ttl) if ttl.checked_duration_since(now).unwrap_or_default() > self.min_ttl => { 
        
                       return Ok(cached.token.clone()); 
        
                   } 
        
                   None => return Ok(cached.token.clone()), 
        
                   _ => (), 
        
               } 
        
           } 
        
           let cached = f().await?; 
        
           let token = cached.token.clone(); 
        
           *locked = Some(cached);

The token cache is supposed to prevent multiple requests to fetch the token by reusing a cached token. However, if the token is close to expiry (within the min_ttl time) it will attempt to refresh it, then set the new token in the cache.

In this case, what's happening is that the GKE metadata service returns the same token for every call up until ~5 minutes before expiry, at which point it generates a new token with an expiry of 1 hour. But min_ttl is hard-coded to 5 minutes (300 seconds).

This creates the potential for a race condition, where if a high volume of calls come into the object_store at ~5 minutes until expiry, they may each:

Lock the mutex
Observe the cached token is near expiry
Get a new token (which is the same as the old token, with the same <5min expiry time)
Save that in the cache
Release the mutex lock

which is what we observe from our logs. If enough requests come in one will overload the service, leaving the mutex locked and preventing any further use of the object_store. For reasons I don't quite understand, the requests don't seem to ever time out, leaving the store stuck until we restart the service.

To Reproduce

Run a service (for example https://github.com/ArroyoSystems/arroyo) on GKE with a workload identity writing to GCS. Make a high volume of parallel requests to the object store, wait an hour, see that many requests are made to the metadata service.

Expected behavior

Only one request should be made to the metadata service.

Proposed solutions

A simple fix is to just reduce the min_ttl for GCS to <=4 minutes. However, I think it's dangerous to rely on the exact behavior of the token generation in a generic subsystem like the token cache. A better solution might look like an asynchronous refresh process that's kicked off when the min_ttl is hit, and runs (with appropriate backoff) until it successfully gets a token with expiry > min_ttl. This would also avoid the latency impact of doing token fetching within the request itself

tustvold · 2024-10-24T22:17:44Z

Hmm... This is kind of unfortunate, we could make the min ttl configurable but this is as you say not an ideal solution. Another option might be to throttle concurrent metadata requests, but again this is just moving the problem.

Can this behaviour of workload identity be configured perhaps, it does seem pretty bizarre to me? It effectively means you can't reliably get a fresh credential nor one that is valid for more than 5 minutes, which seems fairly limiting? Even if we did the fetching as an asynchronous job, we'd run into similar issues.

Perhaps there is some query parameter we could add to force it to generate a new token?

I'm going to change this to an enhancement, as it isn't really a bug, but an enhancement to workaround a limitation of some other component.

mwylde · 2024-10-24T22:42:53Z

I might not have been clear. When you first request the token, it has a TTL of 1 hour. However, the token is cached locally by the metadata service until it has 5 minutes of time left at which point it will generate a new one:

Access tokens expire after a short period of time. The metadata server caches access tokens until they have 5 minutes of remaining time before they expire. If tokens are unable to be cached, requests that exceed 50 queries per second might be rate limited. Your applications must have a valid access token for their API calls to succeed.

So if you just reduce the min_ttl to under five minutes, it will probably be ok. There doesn't seem to be a way to force the metadata service to give you a new token (although I'm far from a GCP expert so there might be something I've missed in the docs).

I do think this is a bug, because currently on GCP you can end up in a situation where object_store is stuck on a lock and unable to make progress, apparently indefinitely. And because the min_ttl isn't configurable from outside the library, there is no workaround except modifying the code.

I think the key thing that needs to be fixed is that the cache will happily overwhelm the metadata service with requests when the token being returned expires in <min_ttl.

tustvold · 2024-10-25T07:41:58Z

object_store is stuck on a lock and unable to make progress

Is this the case, or is it stuck trying to get a token from a crashed metadata server?

Either way I'd be happy to review a PR to make the min-ttl configurable, and perhaps lower the default to 4 minutes for GCP

tustvold · 2024-10-25T08:34:15Z

I've filed #6627 which is related to this, and ultimately why the crate is trying so hard to get a fresh token.

I've also changed this back to a bug, as the distinction isn't really important

alamb · 2024-11-16T14:32:03Z

label_issue.py automatically added labels {'object-store'} from #6638

mwylde added the bug label Oct 24, 2024

mwylde changed the title ~~min_ttl is too short for GKE tokens~~ [Object_store] min_ttl is too short for GKE tokens Oct 24, 2024

mwylde changed the title ~~[Object_store] min_ttl is too short for GKE tokens~~ [Object_store] min_ttl is too high for GKE tokens Oct 24, 2024

mwylde mentioned this issue Oct 24, 2024

API throttle when using GCS as checkpoint backend ArroyoSystems/arroyo#621

Closed

tustvold added enhancement help wanted and removed bug labels Oct 24, 2024

tustvold mentioned this issue Oct 25, 2024

Re-Authorize Request on Retry #6627

Open

tustvold added bug and removed enhancement labels Oct 25, 2024

This was referenced Oct 28, 2024

Lower GCP token min_ttl to 4 minutes and add backoff to token refresh logic #6638

Merged

Update object_store to pull in GCS token refresh fix ArroyoSystems/arroyo#770

Merged

tustvold closed this as completed in #6638 Oct 30, 2024

alamb added the object-store label Nov 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Object_store] min_ttl is too high for GKE tokens #6625

[Object_store] min_ttl is too high for GKE tokens #6625

mwylde commented Oct 24, 2024 •

edited

Loading

tustvold commented Oct 24, 2024 •

edited

Loading

mwylde commented Oct 24, 2024 •

edited

Loading

tustvold commented Oct 25, 2024 •

edited

Loading

tustvold commented Oct 25, 2024

alamb commented Nov 16, 2024

[Object_store] min_ttl is too high for GKE tokens #6625

[Object_store] min_ttl is too high for GKE tokens #6625

Comments

mwylde commented Oct 24, 2024 • edited Loading

tustvold commented Oct 24, 2024 • edited Loading

mwylde commented Oct 24, 2024 • edited Loading

tustvold commented Oct 25, 2024 • edited Loading

tustvold commented Oct 25, 2024

alamb commented Nov 16, 2024

mwylde commented Oct 24, 2024 •

edited

Loading

tustvold commented Oct 24, 2024 •

edited

Loading

mwylde commented Oct 24, 2024 •

edited

Loading

tustvold commented Oct 25, 2024 •

edited

Loading