Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Lambda] Fix missing private ip #4635

Merged
merged 3 commits into from
Feb 4, 2025

Conversation

bend-works
Copy link
Contributor

Handle the setup failure when single-node A6000 VMs on Lambda Cloud do not return a private IP: in this case, cluster info now returns 127.0.0.1.

Tested: sky launch -c private-ip-bug --cloud lambda --gpus A6000 is successful.

@bend-works
Copy link
Contributor Author

fixes #4634

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bend-works for submitting the fix! Please find a quick comment below. : )

Comment on lines 67 to 71
def _get_private_ip(cluster_name_on_cloud: str,
instance_info: Dict[str, Any]) -> str:
if len(_filter_instances(cluster_name_on_cloud, None)) == 1:
return instance_info.get('private_ip', '127.0.0.1')
return instance_info['private_ip']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we just use a argument called single_node for the function, so we don't have to call the cloud API multiple times for initializing the cluster_info for better performance? We can set single_node=(len(running_instances) > 1)

Suggested change
def _get_private_ip(cluster_name_on_cloud: str,
instance_info: Dict[str, Any]) -> str:
if len(_filter_instances(cluster_name_on_cloud, None)) == 1:
return instance_info.get('private_ip', '127.0.0.1')
return instance_info['private_ip']
def _get_private_ip(instance_info: Dict[str, Any], single_node: bool) -> str:
private_ip = instance_info.get('private_ip')
if private_ip is None:
if single_node:
# The Lambda cloud API may return an instance info without
# private IP. It does not align with their docs, but we still
# allow single-node cluster to proceed with the provisioning, by using
# 127.0.0.1, as private IP is not critical for single-node case.
return '127.0.0.1'
msg = f'Failed to retrieve private IP for instance {instance_info}.'
logger.error(msg)
raise RuntimeError(msg)
return private_ip

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it and it’s easier to unit test! I introduced the extra API call because I wasn’t sure about the assumption that all nodes of a cluster would be running at the same time but you know better.

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bend-works! This is great! Merging now.

@Michaelvll Michaelvll linked an issue Feb 4, 2025 that may be closed by this pull request
@Michaelvll Michaelvll merged commit e4ad98c into skypilot-org:master Feb 4, 2025
18 checks passed
@bend-works bend-works deleted the fix-lambda-private-ip branch February 4, 2025 08:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Lambda] Missing private_ip breaks setup on A6000 VM
2 participants