Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCS: Mosiacml-streaming overloads the GCP metadata service when too many processes are used. #728

Closed
smspillaz opened this issue Jul 18, 2024 · 2 comments · Fixed by #817
Labels
bug Something isn't working

Comments

@smspillaz
Copy link

smspillaz commented Jul 18, 2024

Environment

  • OS: Debian 12 on GCE
  • Hardware (GPU, or instance type): N4

To reproduce

Steps to reproduce the behavior:

  1. Run inside of a GCE machine with a service account (eg, using service account authentication with credentials coming from the metadata service)
  2. Run inside of a docker container
  3. Use torchrun to launch multiple processes (eg, 4)
  4. Use StreamingDataset in a DataLoader with many worker processes (eg, 4)
  5. Dataset source is from GCS

Because https://github.com/mosaicml/streaming/blob/main/streaming/base/storage/download.py#L235 queries the metadata service every time it is invoked in order to get credentials, doing this from multiple sub-processes at the same time can overload the service and exhaust the available connections, resulting in this warning:

WARNING Compute Engine Metadata server unavailable on attempt 1 of 5. Reason: [Errno 99] Cannot assign requested address

Backoff inside of google-auth doesn't appear to add any jitter (its just exponential), so if the worker subprocesses are running roughly synchronized, then this eventually fails, even if we increase the timeout as specified here.

In principle we should not have to query the metadata service all the time to get credentials. They are short-lived, but the google.auth.compute_engine.GCECredentials object provides an expired property (https://google-auth.readthedocs.io/en/master/reference/google.auth.compute_engine.html#module-google.auth.compute_engine). So it should be possible to cache the retrieved credentials for a given project ID and only refresh them when needed. In our case, we are monkey-patching the function to do the same thing:

def _patched_gce_credentials_wrapper():
    cached_credentials = {}

    def _patched_get_gce_credentials(request=None, quota_project_id=None):
        credentials, project_id = cached_credentials.get(quota_project_id, (None, None))
        now = datetime.datetime.now(datetime.timezone.utc)
        now = now.replace(tzinfo=None)

        if not credentials:
            print("Initial fetch GCP credentials")
            credentials, project_id = _get_gce_credentials(request=request, quota_project_id=quota_project_id)
            cached_credentials[quota_project_id] = (credentials, project_id)

        print("Credentials expired?", (credentials.expiry - datetime.timedelta(seconds=100) if credentials.expiry is not None else 0), now)

        if credentials.expiry is None or (credentials.expiry - datetime.timedelta(seconds=100)) < now:
            print("Refresh GCP credentials", credentials.expiry)
            request = google.auth.transport._http_client.Request()
            credentials.refresh(request)

        return credentials, project_id

    return _patched_get_gce_credentials

google.auth._default._get_gce_credentials = _patched_gce_credentials_wrapper()

Expected behavior

Shards can be fetched from GCS and too many concurrent queries are not made to the metadata service. Probably the fix here is to somehow cache and refresh the credentials in the same way, though its unclear to me where the caching should happen.

@smspillaz smspillaz added the bug Something isn't working label Jul 18, 2024
@smspillaz smspillaz changed the title GCS: Mosiacml-streaming overloads the GCP metadata service when too many threads are used. GCS: Mosiacml-streaming overloads the GCP metadata service when too many processes are used. Jul 18, 2024
@snarayan21
Copy link
Collaborator

I see, thanks for flagging. Yes, this is a known issue where we get credentials / metadata for each shard access. We have an eye on this and may pursue a permanent solution. Your current solution seems to be a possible route too.

@karan6181
Copy link
Collaborator

@snarayan21, is there any temporary solution you have in mind? And what's the permanent solution you are thinking of? Is it wrapping all the download functions per cloud provider in a class?

@rishabhm12 @smspillaz Wondering, if anyone of you is interested in making a contribution?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants