-
Notifications
You must be signed in to change notification settings - Fork 559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Lambda] Missing private_ip breaks setup on A6000 VM #4634
Comments
hmm, this might be an issue with the lambda cloud API that fails to return the private IP for instances. After this happens, if you do |
It seems that the cloud API still doesn't provide any private IP after a while. I refreshed twice with 3 minute intervals and attempted to fix with another launch command on the same cluster: same issue. % sky launch -c private-ip-bug --cloud lambda --gpus A6000 --region us-south-1
Considered resources (1 node):
--------------------------------------------------------------------------------------------
CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
--------------------------------------------------------------------------------------------
Lambda gpu_1x_a6000 14 100 A6000:1 us-south-1 0.80 ✔
--------------------------------------------------------------------------------------------
Launching a new cluster 'private-ip-bug'. Proceed? [Y/n]: Y
⚙︎ Launching on Lambda us-south-1.
⨯ Failed to set up SkyPilot runtime on cluster. View logs at: ~/sky_logs/sky-2025-02-03-08-37-02-816035/provision.log
KeyError: 'private_ip'
% sky status -r private-ip-bug
Clusters
NAME LAUNCHED RESOURCES STATUS AUTOSTOP COMMAND
private-ip-bug 3 mins ago 1x Lambda(gpu_1x_a6000, {'A6000': 1}) INIT - sky launch -c private-ip-...
% sky status -r private-ip-bug
Clusters
NAME LAUNCHED RESOURCES STATUS AUTOSTOP COMMAND
private-ip-bug 6 mins ago 1x Lambda(gpu_1x_a6000, {'A6000': 1}) INIT - sky launch -c private-ip-...
% sky launch -c private-ip-bug --cloud lambda --gpus A6000 --region us-south-1
Running task on cluster private-ip-bug...
Cluster 'private-ip-bug' (status: INIT) was previously in Lambda (us-south-1). Restarting.
⚙︎ Launching on Lambda us-south-1.
⨯ Failed to set up SkyPilot runtime on cluster. View logs at: ~/sky_logs/sky-2025-02-03-08-43-19-017421/provision.log
KeyError: 'private_ip' Can you confirm that the private IP is only used when there are more than one node in a SkyPilot cluster? |
Hmm, this is weird. We should raise an issue to Lambda's cloud API for this. Regarding the private IP, I believe we don't actually need it for single node case, i.e., setting it to |
Thank you for the pointer. In the meantime, can we make the Lambda integration more robust with something along the lines of the PR above? Newer single-node GPUs don't have that issue. |
Bug
VMs with A6000 GPUs fail setting up on Lambda Cloud. You can reproduce with the command below: region was just added for availability, number of GPUs doesn’t seem to matter. The issue was reproduced on 1x and 2x A6000 with SkyPilot version 1.0.0.dev20250130.
Logs
instance_info
is missingprivate_ip
and there is no substitute key in the object. The immediate workaroundinstance_info.get('private_ip', 'NA')
fixes simple launches but would hide the problem when private IPs are actually needed. How would you approach a proper solution?The text was updated successfully, but these errors were encountered: