Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster timeouts for large images #142

Open
mdagost opened this issue Oct 31, 2023 · 0 comments
Open

Cluster timeouts for large images #142

mdagost opened this issue Oct 31, 2023 · 0 comments

Comments

@mdagost
Copy link

mdagost commented Oct 31, 2023

@ygong1 tagging you because you seem to have done all the CUDA container stuff. I'm finding that I am running into issues with starting up Databricks clusters based off of these images because they are so big--the clusters seem to be running into timeout errors downloading the images.

The base cuda-pytorch image appears to be 7 GB in and of itself. I'm seeing that adding even a few things on top pushes it to like 7.8 GB. At that size, the same image on ECR sometimes successfully starts a cluster, and sometimes fails with an error like

Internal error message: Failed to launch spark container on instance i-xxxx. Exception: Container setup has timed out

@ygong1 are you all seeing anything internally like this? It doesn't appear that the timeout for container download is adjustable by the user.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant