You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@ygong1 tagging you because you seem to have done all the CUDA container stuff. I'm finding that I am running into issues with starting up Databricks clusters based off of these images because they are so big--the clusters seem to be running into timeout errors downloading the images.
The base cuda-pytorch image appears to be 7 GB in and of itself. I'm seeing that adding even a few things on top pushes it to like 7.8 GB. At that size, the same image on ECR sometimes successfully starts a cluster, and sometimes fails with an error like
Internal error message: Failed to launch spark container on instance i-xxxx. Exception: Container setup has timed out
@ygong1 are you all seeing anything internally like this? It doesn't appear that the timeout for container download is adjustable by the user.
The text was updated successfully, but these errors were encountered:
@ygong1 tagging you because you seem to have done all the CUDA container stuff. I'm finding that I am running into issues with starting up Databricks clusters based off of these images because they are so big--the clusters seem to be running into timeout errors downloading the images.
The base
cuda-pytorch
image appears to be 7 GB in and of itself. I'm seeing that adding even a few things on top pushes it to like 7.8 GB. At that size, the same image on ECR sometimes successfully starts a cluster, and sometimes fails with an error like@ygong1 are you all seeing anything internally like this? It doesn't appear that the timeout for container download is adjustable by the user.
The text was updated successfully, but these errors were encountered: