GCS: Mosiacml-streaming overloads the GCP metadata service when too many processes are used. #728

smspillaz · 2024-07-18T07:32:59Z

Environment

OS: Debian 12 on GCE
Hardware (GPU, or instance type): N4

To reproduce

Steps to reproduce the behavior:

Run inside of a GCE machine with a service account (eg, using service account authentication with credentials coming from the metadata service)
Run inside of a docker container
Use torchrun to launch multiple processes (eg, 4)
Use StreamingDataset in a DataLoader with many worker processes (eg, 4)
Dataset source is from GCS

Because https://github.com/mosaicml/streaming/blob/main/streaming/base/storage/download.py#L235 queries the metadata service every time it is invoked in order to get credentials, doing this from multiple sub-processes at the same time can overload the service and exhaust the available connections, resulting in this warning:

WARNING Compute Engine Metadata server unavailable on attempt 1 of 5. Reason: [Errno 99] Cannot assign requested address

Backoff inside of google-auth doesn't appear to add any jitter (its just exponential), so if the worker subprocesses are running roughly synchronized, then this eventually fails, even if we increase the timeout as specified here.

In principle we should not have to query the metadata service all the time to get credentials. They are short-lived, but the google.auth.compute_engine.GCECredentials object provides an expired property (https://google-auth.readthedocs.io/en/master/reference/google.auth.compute_engine.html#module-google.auth.compute_engine). So it should be possible to cache the retrieved credentials for a given project ID and only refresh them when needed. In our case, we are monkey-patching the function to do the same thing:

def _patched_gce_credentials_wrapper():
    cached_credentials = {}

    def _patched_get_gce_credentials(request=None, quota_project_id=None):
        credentials, project_id = cached_credentials.get(quota_project_id, (None, None))
        now = datetime.datetime.now(datetime.timezone.utc)
        now = now.replace(tzinfo=None)

        if not credentials:
            print("Initial fetch GCP credentials")
            credentials, project_id = _get_gce_credentials(request=request, quota_project_id=quota_project_id)
            cached_credentials[quota_project_id] = (credentials, project_id)

        print("Credentials expired?", (credentials.expiry - datetime.timedelta(seconds=100) if credentials.expiry is not None else 0), now)

        if credentials.expiry is None or (credentials.expiry - datetime.timedelta(seconds=100)) < now:
            print("Refresh GCP credentials", credentials.expiry)
            request = google.auth.transport._http_client.Request()
            credentials.refresh(request)

        return credentials, project_id

    return _patched_get_gce_credentials

google.auth._default._get_gce_credentials = _patched_gce_credentials_wrapper()

Expected behavior

Shards can be fetched from GCS and too many concurrent queries are not made to the metadata service. Probably the fix here is to somehow cache and refresh the credentials in the same way, though its unclear to me where the caching should happen.

The text was updated successfully, but these errors were encountered:

snarayan21 · 2024-07-23T09:23:40Z

I see, thanks for flagging. Yes, this is a known issue where we get credentials / metadata for each shard access. We have an eye on this and may pursue a permanent solution. Your current solution seems to be a possible route too.

karan6181 · 2024-10-09T23:55:27Z

@snarayan21, is there any temporary solution you have in mind? And what's the permanent solution you are thinking of? Is it wrapping all the download functions per cloud provider in a class?

@rishabhm12 @smspillaz Wondering, if anyone of you is interested in making a contribution?

smspillaz added the bug Something isn't working label Jul 18, 2024

smspillaz changed the title ~~GCS: Mosiacml-streaming overloads the GCP metadata service when too many threads are used.~~ GCS: Mosiacml-streaming overloads the GCP metadata service when too many processes are used. Jul 18, 2024

rishabhm12 mentioned this issue Sep 5, 2024

GCS Auth ERROR/Download timeout #774

Closed

ethantang-db mentioned this issue Oct 29, 2024

refactored the download module to have reusable clients #817

Merged

8 tasks

ethantang-db closed this as completed in #817 Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GCS: Mosiacml-streaming overloads the GCP metadata service when too many processes are used. #728

GCS: Mosiacml-streaming overloads the GCP metadata service when too many processes are used. #728

smspillaz commented Jul 18, 2024 •

edited

Loading

snarayan21 commented Jul 23, 2024

karan6181 commented Oct 9, 2024

GCS: Mosiacml-streaming overloads the GCP metadata service when too many processes are used. #728

GCS: Mosiacml-streaming overloads the GCP metadata service when too many processes are used. #728

Comments

smspillaz commented Jul 18, 2024 • edited Loading

Environment

To reproduce

Expected behavior

snarayan21 commented Jul 23, 2024

karan6181 commented Oct 9, 2024

smspillaz commented Jul 18, 2024 •

edited

Loading