You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Data resides in gcs and we are using a service account
I am trying to stream data and I have three data_loaders which are being used for alternate learning of the model. One of the datasets is an <image, text> data having 200M+ <image, text> pairs. I run into the following error when num_workers per data_loader >= 4
File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/streaming/base/dataset.py", line 1311, in on_exception raise exception File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/streaming/base/dataset.py", line 1361, in _prepare_thread self.prepare_shard(shard_id, False) File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/streaming/base/dataset.py", line 1192, in prepare_shard delta = stream.prepare_shard(shard) File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/streaming/base/stream.py", line 422, in prepare_shard delta += self._prepare_shard_part(raw_info, zip_info, shard.compression) File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/streaming/base/stream.py", line 396, in _prepare_shard_part self._download_file(raw_info.basename) File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/streaming/base/stream.py", line 311, in _download_file retry(num_attempts=self.download_retry)( File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/streaming/base/util.py", line 525, in new_func raise e File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/streaming/base/util.py", line 521, in new_func return func(*args, **kwargs) File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/streaming/base/stream.py", line 312, in <lambda> lambda: download_file(remote, local, self.download_timeout))() File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/streaming/base/storage/download.py", line 542, in download_file download_from_gcs(remote, local) File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/streaming/base/storage/download.py", line 197, in download_from_gcs raise ValueError(GCS_ERROR_NO_AUTHENTICATION) ValueError: Either set the environment variablesGCS_KEYandGCS_SECRET or use any of the methods in https://cloud.google.com/docs/authentication/external/set-up-adc to set up Application Default Credentials. See also https://docs.mosaicml.com/projects/mcli/en/latest/resources/secrets/gcp.html
The env has all the necessary permissions to access gcs. The error mitigates when num_workers per data_loader are reduced.
This issue seems very closely related to Issues/728.
Scalibility to more # gpu processes will become a problem?
The text was updated successfully, but these errors were encountered:
ENV:
I am trying to stream data and I have three data_loaders which are being used for alternate learning of the model. One of the datasets is an <image, text> data having 200M+ <image, text> pairs. I run into the following error when num_workers per data_loader >= 4
File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/streaming/base/dataset.py", line 1311, in on_exception raise exception File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/streaming/base/dataset.py", line 1361, in _prepare_thread self.prepare_shard(shard_id, False) File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/streaming/base/dataset.py", line 1192, in prepare_shard delta = stream.prepare_shard(shard) File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/streaming/base/stream.py", line 422, in prepare_shard delta += self._prepare_shard_part(raw_info, zip_info, shard.compression) File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/streaming/base/stream.py", line 396, in _prepare_shard_part self._download_file(raw_info.basename) File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/streaming/base/stream.py", line 311, in _download_file retry(num_attempts=self.download_retry)( File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/streaming/base/util.py", line 525, in new_func raise e File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/streaming/base/util.py", line 521, in new_func return func(*args, **kwargs) File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/streaming/base/stream.py", line 312, in <lambda> lambda: download_file(remote, local, self.download_timeout))() File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/streaming/base/storage/download.py", line 542, in download_file download_from_gcs(remote, local) File "/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/streaming/base/storage/download.py", line 197, in download_from_gcs raise ValueError(GCS_ERROR_NO_AUTHENTICATION) ValueError: Either set the environment variables
GCS_KEYand
GCS_SECRETor use any of the methods in https://cloud.google.com/docs/authentication/external/set-up-adc to set up Application Default Credentials. See also https://docs.mosaicml.com/projects/mcli/en/latest/resources/secrets/gcp.html
The env has all the necessary permissions to access gcs. The error mitigates when num_workers per data_loader are reduced.
This issue seems very closely related to Issues/728.
The text was updated successfully, but these errors were encountered: