Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Controlling disk size for spot controller #2007

Closed
xelibrion opened this issue Jun 1, 2023 · 2 comments
Closed

Controlling disk size for spot controller #2007

xelibrion opened this issue Jun 1, 2023 · 2 comments

Comments

@xelibrion
Copy link

xelibrion commented Jun 1, 2023

I'm trying to kick off model training on Azure, and running into issue where skypilot is unable to launch a controller node.
I've added adisk_size parameter, but it does not seem to be helping, as it looks like it only controls disk size for a worker node.
For spot controller, skypilot has chosen Standard_D8s_v5 instance type, and it seems the default 50GB disk is not satisfying some Azure constraints.

config:

resources:
  cloud: azure
  accelerators: A100:1  # 1x NVIDIA A100 GPU
  disk_size: 150
Exception Details:	(OperationNotAllowed) The specified disk size 50 GB is smaller than the size of the corresponding disk in the VM image: 150 GB. This is not allowed. Please choose equal or greater size or do not specify an explicit size.
	Code: OperationNotAllowed
	Message: The specified disk size 50 GB is smaller than the size of the corresponding disk in the VM image: 150 GB. This is not allowed. Please choose equal or greater size or do not specify an explicit size.
	Target: osDisk.diskSizeGB

Could anyone suggest a workaround here?

@concretevitamin
Copy link
Member

Hey @xelibrion - you could modify it in this controller template: https://github.com/skypilot-org/skypilot/blob/master/sky/templates/spot-controller.yaml.j2#L5

For it to take effect, please install from source.

TODO for us: add this to docs & make it easier.

@concretevitamin
Copy link
Member

This is now documented in https://skypilot.readthedocs.io/en/latest/examples/spot-jobs.html#customizing-spot-controller-resources. Please feel free to reopen this if that doc page isn't clear for this use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants