Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Lambda] Missing private_ip breaks setup on A6000 VM #4634

Closed
bend-works opened this issue Feb 2, 2025 · 4 comments · Fixed by #4635
Closed

[Lambda] Missing private_ip breaks setup on A6000 VM #4634

bend-works opened this issue Feb 2, 2025 · 4 comments · Fixed by #4635

Comments

@bend-works
Copy link
Contributor

Bug

VMs with A6000 GPUs fail setting up on Lambda Cloud. You can reproduce with the command below: region was just added for availability, number of GPUs doesn’t seem to matter. The issue was reproduced on 1x and 2x A6000 with SkyPilot version 1.0.0.dev20250130.

% sky launch -c private-ip-bug --cloud lambda --gpus A6000 --region us-south-1
Considered resources (1 node):
--------------------------------------------------------------------------------------------
 CLOUD    INSTANCE       vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN
--------------------------------------------------------------------------------------------
 Lambda   gpu_1x_a6000   14      100       A6000:1        us-south-1    0.80          ✔
--------------------------------------------------------------------------------------------
Launching a new cluster 'private-ip-bug'. Proceed? [Y/n]: Y
⚙︎ Launching on Lambda us-south-1.
⨯ Failed to set up SkyPilot runtime on cluster.  View logs at: ~/sky_logs/sky-2025-02-02-16-56-09-259102/provision.log

KeyError: 'private_ip'

Logs

instance_info is missing private_ip and there is no substitute key in the object. The immediate workaround instance_info.get('private_ip', 'NA') fixes simple launches but would hide the problem when private IPs are actually needed. How would you approach a proper solution?

D 02-02 16:58:30 [provisioner.py:650](http://provisioner.py:650/)] ==================== System Setup After Provision ====================
D 02-02 16:58:30 [provisioner.py:650](http://provisioner.py:650/)]
E 02-02 16:58:31 [provisioner.py:657](http://provisioner.py:657/)] ⨯ Failed to set up SkyPilot runtime on cluster.  View logs at: ~/sky_logs/sky-2025-02-02-16-56-09-259102/provision.log
D 02-02 16:58:31 [provisioner.py:661](http://provisioner.py:661/)] Stacktrace:
D 02-02 16:58:31 [provisioner.py:661](http://provisioner.py:661/)] Traceback (most recent call last):
D 02-02 16:58:31 [provisioner.py:661](http://provisioner.py:661/)]   File ".venv/lib/python3.11/site-packages/sky/provision/provisioner.py", line 651, in post_provision_runtime_setup
D 02-02 16:58:31 [provisioner.py:661](http://provisioner.py:661/)]     return _post_provision_setup(cloud_name,
D 02-02 16:58:31 [provisioner.py:661](http://provisioner.py:661/)]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
D 02-02 16:58:31 [provisioner.py:661](http://provisioner.py:661/)]   File ".venv/lib/python3.11/site-packages/sky/provision/provisioner.py", line 400, in _post_provision_setup
D 02-02 16:58:31 [provisioner.py:661](http://provisioner.py:661/)]     cluster_info = provision.get_cluster_info(cloud_name,
D 02-02 16:58:31 [provisioner.py:661](http://provisioner.py:661/)]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
D 02-02 16:58:31 [provisioner.py:661](http://provisioner.py:661/)]   File ".venv/lib/python3.11/site-packages/sky/provision/init.py", line 52, in _wrapper
D 02-02 16:58:31 [provisioner.py:661](http://provisioner.py:661/)]     return impl(*args, **kwargs)
D 02-02 16:58:31 [provisioner.py:661](http://provisioner.py:661/)]            ^^^^^^^^^^^^^^^^^^^^^
D 02-02 16:58:31 [provisioner.py:661](http://provisioner.py:661/)]   File ".venv/lib/python3.11/site-packages/sky/provision/lambda_cloud/instance.py", line 206, in get_cluster_info
D 02-02 16:58:31 [provisioner.py:661](http://provisioner.py:661/)]     internal_ip=instance_info['private_ip'],
D 02-02 16:58:31 [provisioner.py:661](http://provisioner.py:661/)]                 ~~~~~~~~~~~~~^^^^^^^^^^^^^^
D 02-02 16:58:31 [provisioner.py:661](http://provisioner.py:661/)] KeyError: 'private_ip'
@Michaelvll
Copy link
Collaborator

hmm, this might be an issue with the lambda cloud API that fails to return the private IP for instances. After this happens, if you do sky status -r cluster-name, would it fix the issue? Just wondering if this is just the cloud API is not ready for getting private IP right after the instance is launched.

@bend-works
Copy link
Contributor Author

bend-works commented Feb 3, 2025

It seems that the cloud API still doesn't provide any private IP after a while. I refreshed twice with 3 minute intervals and attempted to fix with another launch command on the same cluster: same issue.

% sky launch -c private-ip-bug --cloud lambda --gpus A6000 --region us-south-1
Considered resources (1 node):
--------------------------------------------------------------------------------------------
 CLOUD    INSTANCE       vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN
--------------------------------------------------------------------------------------------
 Lambda   gpu_1x_a6000   14      100       A6000:1        us-south-1    0.80          ✔
--------------------------------------------------------------------------------------------
Launching a new cluster 'private-ip-bug'. Proceed? [Y/n]: Y
⚙︎ Launching on Lambda us-south-1.
⨯ Failed to set up SkyPilot runtime on cluster.  View logs at: ~/sky_logs/sky-2025-02-03-08-37-02-816035/provision.log

KeyError: 'private_ip'
% sky status -r private-ip-bug
Clusters
NAME            LAUNCHED    RESOURCES                              STATUS  AUTOSTOP  COMMAND
private-ip-bug  3 mins ago  1x Lambda(gpu_1x_a6000, {'A6000': 1})  INIT    -         sky launch -c private-ip-...
% sky status -r private-ip-bug
Clusters
NAME            LAUNCHED    RESOURCES                              STATUS  AUTOSTOP  COMMAND
private-ip-bug  6 mins ago  1x Lambda(gpu_1x_a6000, {'A6000': 1})  INIT    -         sky launch -c private-ip-...
% sky launch -c private-ip-bug --cloud lambda --gpus A6000 --region us-south-1
Running task on cluster private-ip-bug...
Cluster 'private-ip-bug' (status: INIT) was previously in Lambda (us-south-1). Restarting.
⚙︎ Launching on Lambda us-south-1.
⨯ Failed to set up SkyPilot runtime on cluster.  View logs at: ~/sky_logs/sky-2025-02-03-08-43-19-017421/provision.log

KeyError: 'private_ip'

Can you confirm that the private IP is only used when there are more than one node in a SkyPilot cluster?

@Michaelvll
Copy link
Collaborator

Hmm, this is weird. We should raise an issue to Lambda's cloud API for this.

Regarding the private IP, I believe we don't actually need it for single node case, i.e., setting it to 127.0.0.1 would work : )

@bend-works
Copy link
Contributor Author

Thank you for the pointer. In the meantime, can we make the Lambda integration more robust with something along the lines of the PR above? Newer single-node GPUs don't have that issue.

@Michaelvll Michaelvll linked a pull request Feb 4, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants