Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCS Auth ERROR/Download timeout #774

Closed
rishabhm12 opened this issue Sep 5, 2024 · 1 comment
Closed

GCS Auth ERROR/Download timeout #774

rishabhm12 opened this issue Sep 5, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@rishabhm12
Copy link

rishabhm12 commented Sep 5, 2024

ENV:

  • Ubuntu 22.04.4 LTS
  • a2-ultragpu-8g ( 8 a100)
  • torch==1.13.1
  • DDP
  • Data resides in gcs and we are using a service account

I am trying to stream data and I have three data_loaders which are being used for alternate learning of the model. One of the datasets is an <image, text> data having 200M+ <image, text> pairs. I run into the following error when num_workers per data_loader >= 4

File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/streaming/base/dataset.py", line 1311, in on_exception raise exception File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/streaming/base/dataset.py", line 1361, in _prepare_thread self.prepare_shard(shard_id, False) File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/streaming/base/dataset.py", line 1192, in prepare_shard delta = stream.prepare_shard(shard) File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/streaming/base/stream.py", line 422, in prepare_shard delta += self._prepare_shard_part(raw_info, zip_info, shard.compression) File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/streaming/base/stream.py", line 396, in _prepare_shard_part self._download_file(raw_info.basename) File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/streaming/base/stream.py", line 311, in _download_file retry(num_attempts=self.download_retry)( File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/streaming/base/util.py", line 525, in new_func raise e File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/streaming/base/util.py", line 521, in new_func return func(*args, **kwargs) File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/streaming/base/stream.py", line 312, in <lambda> lambda: download_file(remote, local, self.download_timeout))() File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/streaming/base/storage/download.py", line 542, in download_file download_from_gcs(remote, local) File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/streaming/base/storage/download.py", line 197, in download_from_gcs raise ValueError(GCS_ERROR_NO_AUTHENTICATION) ValueError: Either set the environment variablesGCS_KEYandGCS_SECRET or use any of the methods in https://cloud.google.com/docs/authentication/external/set-up-adc to set up Application Default Credentials. See also https://docs.mosaicml.com/projects/mcli/en/latest/resources/secrets/gcp.html

The env has all the necessary permissions to access gcs. The error mitigates when num_workers per data_loader are reduced.
This issue seems very closely related to Issues/728.

  • Scalibility to more # gpu processes will become a problem?
@rishabhm12 rishabhm12 added the bug Something isn't working label Sep 5, 2024
@karan6181
Copy link
Collaborator

Closing this as a duplicate of #728

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants