-
Notifications
You must be signed in to change notification settings - Fork 510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FluidStack] Add FluidStack Integration (Provisioner Interface) #3086
[FluidStack] Add FluidStack Integration (Provisioner Interface) #3086
Conversation
* Added FluidStack using the Provisioner interface * Added _get_timeout in tests/test_smoke.py for custom cloud timeouts * Added new template variable containing usernames for FluidStack deployments in backend_utils.py:_write_cluster_config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for moving the fluidstack implementation to the new provisioner @mjibril! The implementation looks mostly good to me. Left several comments. Will try it out later.
sky/backends/cloud_vm_ray_backend.py
Outdated
@staticmethod | ||
def _fluidstack_handler(blocked_resources: Set['resources_lib.Resources'], | ||
launchable_resources: 'resources_lib.Resources', | ||
region: 'clouds.Region', | ||
zones: Optional[List['clouds.Zone']], stdout: str, | ||
stderr: str) -> None: | ||
del zones # Unused. | ||
style = colorama.Style | ||
stdout_splits = stdout.split('\n') | ||
stderr_splits = stderr.split('\n') | ||
errors = [ | ||
s.strip() | ||
for s in stdout_splits + stderr_splits | ||
if 'FluidstackAPIError:' in s.strip() | ||
] | ||
if not errors: | ||
logger.info('====== stdout ======') | ||
for s in stdout_splits: | ||
print(s) | ||
logger.info('====== stderr ======') | ||
for s in stderr_splits: | ||
print(s) | ||
with ux_utils.print_exception_no_traceback(): | ||
raise RuntimeError('Errors occurred during provision; ' | ||
'check logs above.') | ||
|
||
logger.warning(f'Got error(s) in {region.name}:') | ||
messages = '\n\t'.join(errors) | ||
logger.warning(f'{style.DIM}\t{messages}{style.RESET_ALL}') | ||
_add_to_blocked_resources(blocked_resources, | ||
launchable_resources.copy(zone=None)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not needed now, as we will use the FailoverCloudErrorHandlerV2
@staticmethod | |
def _fluidstack_handler(blocked_resources: Set['resources_lib.Resources'], | |
launchable_resources: 'resources_lib.Resources', | |
region: 'clouds.Region', | |
zones: Optional[List['clouds.Zone']], stdout: str, | |
stderr: str) -> None: | |
del zones # Unused. | |
style = colorama.Style | |
stdout_splits = stdout.split('\n') | |
stderr_splits = stderr.split('\n') | |
errors = [ | |
s.strip() | |
for s in stdout_splits + stderr_splits | |
if 'FluidstackAPIError:' in s.strip() | |
] | |
if not errors: | |
logger.info('====== stdout ======') | |
for s in stdout_splits: | |
print(s) | |
logger.info('====== stderr ======') | |
for s in stderr_splits: | |
print(s) | |
with ux_utils.print_exception_no_traceback(): | |
raise RuntimeError('Errors occurred during provision; ' | |
'check logs above.') | |
logger.warning(f'Got error(s) in {region.name}:') | |
messages = '\n\t'.join(errors) | |
logger.warning(f'{style.DIM}\t{messages}{style.RESET_ALL}') | |
_add_to_blocked_resources(blocked_resources, | |
launchable_resources.copy(zone=None)) |
sky/clouds/__init__.py
Outdated
'IBM', 'AWS', 'Azure', 'Cloud', 'GCP', 'Lambda', 'Local', 'SCP', 'RunPod', | ||
'OCI', 'Vsphere', 'Kubernetes', 'CloudImplementationFeatures', 'Region', | ||
'Zone', 'CLOUD_REGISTRY', 'ProvisionerVersion', 'StatusVersion', | ||
'Fluidstack' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: just for readability
'Fluidstack' | |
'Fluidstack', |
sky/backends/backend_utils.py
Outdated
fluidstack_username = 'ubuntu' | ||
if isinstance(cloud, clouds.Fluidstack): | ||
fluidstack_username = cloud.default_username(to_provision.region) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this into the cloud.make_deploy_variables
called in L781, instead of having it in this function.
sky/provision/fluidstack/instance.py
Outdated
instances = _filter_instances(cluster_name_on_cloud, None) | ||
non_running_states = [ | ||
'create', 'requesting', 'provisioning', 'customizing', 'start', | ||
'starting', 'rebooting', 'stopping', 'stop', 'stopped', 'reboot', | ||
'terminating' | ||
] | ||
status_map = {} | ||
for state in non_running_states: | ||
status_map[state] = status_lib.ClusterStatus.INIT | ||
|
||
status_map['running'] = status_lib.ClusterStatus.UP | ||
statuses: Dict[str, Optional[status_lib.ClusterStatus]] = {} | ||
for inst_id, inst in instances.items(): | ||
status = status_map[inst['status']] | ||
if non_terminated_only and status is None: | ||
continue | ||
statuses[inst_id] = status | ||
return statuses |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems we have more detailed status mapping from the Fluidstack.query_status
. We should move that here.
instances = _filter_instances(cluster_name_on_cloud, None) | |
non_running_states = [ | |
'create', 'requesting', 'provisioning', 'customizing', 'start', | |
'starting', 'rebooting', 'stopping', 'stop', 'stopped', 'reboot', | |
'terminating' | |
] | |
status_map = {} | |
for state in non_running_states: | |
status_map[state] = status_lib.ClusterStatus.INIT | |
status_map['running'] = status_lib.ClusterStatus.UP | |
statuses: Dict[str, Optional[status_lib.ClusterStatus]] = {} | |
for inst_id, inst in instances.items(): | |
status = status_map[inst['status']] | |
if non_terminated_only and status is None: | |
continue | |
statuses[inst_id] = status | |
return statuses | |
instances = _filter_instances(cluster_name_on_cloud, None) | |
status_map = { | |
'provisioning': status_lib.ClusterStatus.INIT, | |
'requesting': status_lib.ClusterStatus.INIT, | |
'create': status_lib.ClusterStatus.INIT, | |
'customizing': status_lib.ClusterStatus.INIT, | |
'stopping': status_lib.ClusterStatus.STOPPED, | |
'stop': status_lib.ClusterStatus.STOPPED, | |
'start': status_lib.ClusterStatus.INIT, | |
'reboot': status_lib.ClusterStatus.STOPPED, | |
'rebooting': status_lib.ClusterStatus.STOPPED, | |
'stopped': status_lib.ClusterStatus.STOPPED, | |
'starting': status_lib.ClusterStatus.INIT, | |
'running': status_lib.ClusterStatus.UP, | |
'failed to create': status_lib.ClusterStatus.INIT, | |
'timeout error': status_lib.ClusterStatus.INIT, | |
'out of stock': status_lib.ClusterStatus.INIT, | |
} | |
statuses: Dict[str, Optional[status_lib.ClusterStatus]] = {} | |
for inst_id, inst in instances.items(): | |
status = status_map.get(inst['status'], None) | |
if non_terminated_only and status is None: | |
continue | |
statuses[inst_id] = status | |
return statuses |
sky/clouds/fluidstack.py
Outdated
status_map = { | ||
'provisioning': status_lib.ClusterStatus.INIT, | ||
'requesting': status_lib.ClusterStatus.INIT, | ||
'create': status_lib.ClusterStatus.INIT, | ||
'customizing': status_lib.ClusterStatus.INIT, | ||
'stopping': status_lib.ClusterStatus.STOPPED, | ||
'stop': status_lib.ClusterStatus.STOPPED, | ||
'start': status_lib.ClusterStatus.INIT, | ||
'reboot': status_lib.ClusterStatus.STOPPED, | ||
'rebooting': status_lib.ClusterStatus.STOPPED, | ||
'stopped': status_lib.ClusterStatus.STOPPED, | ||
'starting': status_lib.ClusterStatus.INIT, | ||
'running': status_lib.ClusterStatus.UP, | ||
'failed to create': status_lib.ClusterStatus.INIT, | ||
'timeout error': status_lib.ClusterStatus.INIT, | ||
'out of stock': status_lib.ClusterStatus.INIT, | ||
} | ||
status_list = [] | ||
filtered = fluidstack_utils.FluidstackClient().list_instances( | ||
tag_filters) | ||
for node in filtered: | ||
node_status = status_map.get(node['status'], None) | ||
if node_status is not None: | ||
status_list.append(node_status) | ||
return status_list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is no longer needed, as we have the sky.provision.fluidstack.instance.query_instances
already.
status_map = { | |
'provisioning': status_lib.ClusterStatus.INIT, | |
'requesting': status_lib.ClusterStatus.INIT, | |
'create': status_lib.ClusterStatus.INIT, | |
'customizing': status_lib.ClusterStatus.INIT, | |
'stopping': status_lib.ClusterStatus.STOPPED, | |
'stop': status_lib.ClusterStatus.STOPPED, | |
'start': status_lib.ClusterStatus.INIT, | |
'reboot': status_lib.ClusterStatus.STOPPED, | |
'rebooting': status_lib.ClusterStatus.STOPPED, | |
'stopped': status_lib.ClusterStatus.STOPPED, | |
'starting': status_lib.ClusterStatus.INIT, | |
'running': status_lib.ClusterStatus.UP, | |
'failed to create': status_lib.ClusterStatus.INIT, | |
'timeout error': status_lib.ClusterStatus.INIT, | |
'out of stock': status_lib.ClusterStatus.INIT, | |
} | |
status_list = [] | |
filtered = fluidstack_utils.FluidstackClient().list_instances( | |
tag_filters) | |
for node in filtered: | |
node_status = status_map.get(node['status'], None) | |
if node_status is not None: | |
status_list.append(node_status) | |
return status_list |
@@ -0,0 +1,223 @@ | |||
import functools |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move this file into sky.provision.fluidstack
. We should get rid of the providers
folder for clouds with new provisioner API.
sky/templates/fluidstack-ray.yml.j2
Outdated
{% if num_nodes > 1 %} | ||
ray_worker_default: | ||
min_workers: {{num_nodes - 1}} | ||
max_workers: {{num_nodes - 1}} | ||
resources: {} | ||
node_config: | ||
InstanceType: {{instance_type}} | ||
AuthorizedKey: | | ||
skypilot:ssh_public_key_content | ||
{%- endif %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is no longer needed, as we will always create instances with the head config.
{% if num_nodes > 1 %} | |
ray_worker_default: | |
min_workers: {{num_nodes - 1}} | |
max_workers: {{num_nodes - 1}} | |
resources: {} | |
node_config: | |
InstanceType: {{instance_type}} | |
AuthorizedKey: | | |
skypilot:ssh_public_key_content | |
{%- endif %} |
sky/templates/fluidstack-ray.yml.j2
Outdated
# Command to start ray on the head node. You don't need to change this. | ||
# NOTE: these are very performance-sensitive. Each new item opens/closes an SSH | ||
# connection, which is expensive. Try your best to co-locate commands into fewer | ||
# items! The same comment applies for worker_start_ray_commands. | ||
# | ||
# Increment the following for catching performance bugs easier: | ||
# current num items (num SSH connections): 2 | ||
head_start_ray_commands: | ||
# Start skylet daemon. (Should not place it in the head_setup_commands, otherwise it will run before skypilot is installed.) | ||
- ((ps aux | grep -v nohup | grep -v grep | grep -q -- "python3 -m sky.skylet.skylet") || nohup python3 -m sky.skylet.skylet >> ~/.sky/skylet.log 2>&1 &); | ||
ray stop; RAY_SCHEDULER_EVENTS=0 ray start --disable-usage-stats --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml {{"--resources='%s'" % custom_resources if custom_resources}} || exit 1; | ||
which prlimit && for id in $(pgrep -f raylet/raylet); do sudo prlimit --nofile=1048576:1048576 --pid=$id || true; done; | ||
|
||
{%- if num_nodes > 1 %} | ||
worker_start_ray_commands: | ||
- ray stop; RAY_SCHEDULER_EVENTS=0 ray start --disable-usage-stats --address=$RAY_HEAD_IP:6379 --object-manager-port=8076 {{"--resources='%s'" % custom_resources if custom_resources}} || exit 1; | ||
which prlimit && for id in $(pgrep -f raylet/raylet); do sudo prlimit --nofile=1048576:1048576 --pid=$id || true; done; | ||
{%- else %} | ||
worker_start_ray_commands: [] | ||
{%- endif %} | ||
|
||
head_node: {} | ||
worker_nodes: {} | ||
|
||
# These fields are required for external cloud providers. | ||
head_setup_commands: [] | ||
worker_setup_commands: [] | ||
cluster_synced_files: [] | ||
file_mounts_sync_continuously: False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These lines are no longer needed. Please refer to:
skypilot/sky/templates/aws-ray.yml.j2
Lines 173 to 176 in 5cccd0f
# Command to start ray clusters are now placed in `sky.provision.instance_setup`. | |
# We do not need to list it here anymore. | |
tests/test_smoke.py
Outdated
@@ -784,7 +794,7 @@ def test_file_mounts(generic_cloud: str): | |||
'using_file_mounts', | |||
test_commands, | |||
f'sky down -y {name}', | |||
timeout=20 * 60, # 20 mins | |||
(generic_cloud, 20 * 60), # 20 mins |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(generic_cloud, 20 * 60), # 20 mins | |
timeout=_get_timeout(generic_cloud, 20 * 60), # 20 mins |
|
||
if __name__ == '__main__': | ||
parser = argparse.ArgumentParser() | ||
parser.add_argument('catalog_dir', required=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
required
is not a valid argument for positional. How about we just keep the logic the same as our other clouds, i.e., saving to the current directory
sky/clouds/fluidstack.py
Outdated
'FluidStack cloud does not support' | ||
' stopping VMs. for all DCs', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'FluidStack cloud does not support' | |
' stopping VMs. for all DCs', | |
'FluidStack cloud does not support' | |
' stopping VMs.', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems we do have the stopping
/stopped
states for a VM, is it supported by the cloud but not supported in the implementation?
If that is the case, maybe we can rephrase it to Stopping clusters in Fluidstack is not supported by SkyPilot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a kind bump for the suggestion changes for the hint string above. : )
sky/clouds/fluidstack.py
Outdated
clouds.CloudImplementationFeatures.SPOT_INSTANCE: | ||
'Spot instances are' | ||
f' not supported in {_REPR}.', | ||
clouds.CloudImplementationFeatures.DOCKER_IMAGE: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clouds.CloudImplementationFeatures.DOCKER_IMAGE: | |
clouds.CloudImplementationFeatures.IMAGE_ID: f'Specifying image ID is not supported for {_REPR}.', | |
clouds.CloudImplementationFeatures.DOCKER_IMAGE: |
sky/clouds/fluidstack.py
Outdated
@classmethod | ||
def _cloud_unsupported_features( | ||
cls) -> Dict[clouds.CloudImplementationFeatures, str]: | ||
return cls._CLOUD_UNSUPPORTED_FEATURES |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is no longer needed.
@classmethod | |
def _cloud_unsupported_features( | |
cls) -> Dict[clouds.CloudImplementationFeatures, str]: | |
return cls._CLOUD_UNSUPPORTED_FEATURES |
sky/clouds/fluidstack.py
Outdated
|
||
@classmethod | ||
def get_current_user_identity(cls) -> Optional[List[str]]: | ||
# TODO(ewzeng): Implement get_current_user_identity for Fluidstack |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
# TODO(ewzeng): Implement get_current_user_identity for Fluidstack | |
# TODO: Implement get_current_user_identity for Fluidstack |
DEFAULT_FLUIDSTACK_API_KEY_PATH = os.path.expanduser( | ||
'~/.fluidstack/fluidstack_api_key') | ||
DEFAULT_FLUIDSTACK_API_TOKEN_PATH = os.path.expanduser( | ||
'~/.fluidstack/fluidstack_api_token') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these be ~/.fluidstack/api_key
and ~/.fluidstack/api_token
instead, which is hinted in clouds.Fluidstack.check_credentials
sky/templates/fluidstack-ray.yml.j2
Outdated
sudo apt update; | ||
sudo apt-get -y install libcudnn8=8.8.0.121-1+cuda11.8 libcudnn8-dev=8.8.0.121-1+cuda11.8 libnccl2=2.16.5-1+cuda11.8 libnccl-dev=2.16.5-1+cuda11.8 cuda-toolkit-11-8 cuda-11-8 build-essential devscripts debhelper fakeroot cuda-drivers=520.61.05-1 python3-pip cuda-nvcc-11-8 linux-generic-hwe-20.04 tensorrt=8.5.3.1-1+cuda11.8 libnvinfer8=8.5.3-1+cuda11.8 libnvinfer-plugin8=8.5.3-1+cuda11.8 libnvparsers8=8.5.3-1+cuda11.8 libnvonnxparsers8=8.5.3-1+cuda11.8 libnvinfer-bin=8.5.3-1+cuda11.8 libnvinfer-dev=8.5.3-1+cuda11.8 libnvinfer-plugin-dev=8.5.3-1+cuda11.8 libnvparsers-dev=8.5.3-1+cuda11.8 libnvonnxparsers-dev=8.5.3-1+cuda11.8 libnvinfer-samples=8.5.3-1+cuda11.8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can take a significant amount of time. Is it possible to use a image that has those drivers installed already? Also, it would be better to install the cuda 12.2/12.1 instead of 11.8 to be aligned with the other clouds.
sky/adaptors/fluidstack.py
Outdated
global _fluidstack_sdk | ||
if _fluidstack_sdk is None: | ||
try: | ||
import fluidstack as _fluidstack # pylint: disable=import-outside-toplevel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should update the setup.py
to have an extra for fluidstack
that installs the dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, actually, it seems we are not using fluidstack
package anywhere. Should we leave this file out?
sky/provision/fluidstack/instance.py
Outdated
instances[instance_id] = [ | ||
common.InstanceInfo( | ||
instance_id=instance_id, | ||
internal_ip=instance_info['ip_address'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need the private IP to have it work for job scheduling on multi-node cluster. Is it possible to set the private IP for the internal IP instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the instance_info
does not return the private IP of the instance, a easy way to solve this is to ssh
into the instance and get the private IP manually.
from sky import authentication as auth
from sky.utils import command_runner
from sky.utils import subprocess_utils
_GET_INTERNAL_IP_CMD = 'ip -4 -br addr show | grep UP | grep -Eo "(10\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)|172\.(1[6-9]|2[0-9]|3[0-1]))\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)"'
def get_internal_ip(node_info: Dict[str, Any]) -> str:
runner = command_runner.SSHCommandRunner(node_info['ip_address'], ssh_user=node_info['capabilities']['default_user_name'], ssh_private_key=auth.PRIVATE_SSH_KEY_PATH)
rc, stdout, stderr = runner.run(_GET_INTERNAL_IP_CMD, require_outputs=True, stream_logs=False)
subprocess_utils.handle_returncode(rc, _GET_INTERNAL_IP_CMD, 'Failed get obtain private IP from node', stderr=stdout+stderr)
node_info['internal_ip'] = stdout.strip()
subprocess_utils.run_in_parallel(get_internal_ip, running_instances.values())
head_instance_id = None
for instance_id, instance_info in running_instances.items():
instance_id = instance_info['id']
instances[instance_id] = [
common.InstanceInfo(
instance_id=instance_id,
internal_ip=instance_info['internal_ip'],
external_ip=instance_info['ip_address'],
ssh_port=instance_info['ssh_port'],
tags={},
)
]
if instance_info['hostname'].endswith('-head'):
head_instance_id = instance_id
The following is the entire changes you may want to refer to ; )
781c630
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a kind bump for adopting the changes from the commit 781c630, otherwise the multi-node support will still fail.
* changes requested for in PR comments
* use public ip addresses for nodes if internal ips not available * upgrade CUDA installation commands * use images with NVIDIA drivers pre-installed where available
Merge branch 'fluidstack-provisioner' of github.com:fluidstackio/skypilot into fluidstack-provisioner
* Removed deprecated skylet/providers/fluidstack directory * Fix pylint warnings * Removed deprecated code in cloud_vm_ray_backend.py
1f6412c
to
85af027
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick fix @mjibril! The PR looks mostly good to me with some bumps for the changes proposed in the comments, especially: #3086 (comment)
sky/provision/fluidstack/instance.py
Outdated
# subprocess_utils.handle_returncode(rc, | ||
# _GET_INTERNAL_IP_CMD, | ||
# 'Failed get obtain private IP from node', | ||
# stderr=stdout + stderr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably raise here to fail early when we fail to get internal IP. Do we want to uncomment this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In some DCs, due to the virtualisation/configuration, ip -4 -br addr show | grep UP
only shows the external IP address, which fails the regex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, I see. Could we add a comment here and remove this comment for handle_returncode
?
# Some DCs do not have internal IPs and can fail when getting the IP. We set the `internal_ip` to the same as
# external IP. It should be fine as the `ray cluster` will also get and use that external IP in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just tried the region without the internal IP, and it seems working correctly. Thanks for fix here.
nit: it would be good to change logger.error
above to logger.debug
as well, since whether or not the internal IP presents here will not affect the end-user's usage of the VM. : )
sky/clouds/fluidstack.py
Outdated
'FluidStack cloud does not support' | ||
' stopping VMs. for all DCs', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a kind bump for the suggestion changes for the hint string above. : )
sky/provision/fluidstack/instance.py
Outdated
instances[instance_id] = [ | ||
common.InstanceInfo( | ||
instance_id=instance_id, | ||
internal_ip=instance_info['ip_address'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a kind bump for adopting the changes from the commit 781c630, otherwise the multi-node support will still fail.
3a8a785
to
b03de47
Compare
sky/clouds/fluidstack.py
Outdated
@classmethod | ||
def check_disk_tier_enabled(cls, instance_type: Optional[str], | ||
disk_tier: DiskTier) -> None: | ||
raise exceptions.NotSupportedError( | ||
'Disk tier is not supported by FluidStack.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not needed. Removing this will also fix the CI error. : )
@classmethod | |
def check_disk_tier_enabled(cls, instance_type: Optional[str], | |
disk_tier: DiskTier) -> None: | |
raise exceptions.NotSupportedError( | |
'Disk tier is not supported by FluidStack.') | |
@classmethod | |
def check_disk_tier_enabled(cls, instance_type: Optional[str], | |
disk_tier: DiskTier) -> None: | |
raise exceptions.NotSupportedError( | |
'Disk tier is not supported by FluidStack.') |
sky/clouds/fluidstack.py
Outdated
sudo apt-get -y install cuda-toolkit-12-3; | ||
sudo apt-get install -y cuda-drivers; | ||
sudo apt-get install -y python3-pip; | ||
nvidia-smi;""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvidia-smi;""" | |
nvidia-smi || sudo reboot;""" |
This will fix the issue where the nvidia-smi
fail to be found issue in the user job.
b03de47
to
9b11fcb
Compare
sky/provision/fluidstack/instance.py
Outdated
# subprocess_utils.handle_returncode(rc, | ||
# _GET_INTERNAL_IP_CMD, | ||
# 'Failed get obtain private IP from node', | ||
# stderr=stdout + stderr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, I see. Could we add a comment here and remove this comment for handle_returncode
?
# Some DCs do not have internal IPs and can fail when getting the IP. We set the `internal_ip` to the same as
# external IP. It should be fine as the `ray cluster` will also get and use that external IP in that case.
@@ -267,6 +274,7 @@ def test_minimal(generic_cloud: str): | |||
f'sky logs {name} 2 --status', # Ensure the job succeeded. | |||
], | |||
f'sky down -y {name}', | |||
_get_timeout(generic_cloud), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we revert this _get_timeout
? It seems this function is only used in the tests that has no_fluidstack
mark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_get_timeout
is used in test_minimal
, test_inline_env_file
, test_multi_hostname
, test_inline_env_file
and test_file_mounts
. These tests are applicable to fluidstack
. Without extended timeouts, the smoke tests may fail due to the additional time required to install packages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, makes sense! I misread that. It looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great effort for supporting Fluidstack in SkyPilot @mjibril! The PR looks mostly good to me. Left some final comments and I think we should be ready to go after they are resolved.
6020f61
to
cdb419f
Compare
cdb419f
to
fc2ae25
Compare
* Reformat after pull from skypilot/master * Implement multi-node support for FluidStack * Use CUDA installation on plain distro image * Removed `check_disk_tier_enabled` from `fluidstack.py`
fc2ae25
to
ca493e8
Compare
Added FluidStack using the Provisioner interface
Added _get_timeout in tests/test_smoke.py for custom cloud timeouts
Added new template variable containing usernames for FluidStack deployments in backend_utils.py:_write_cluster_config
Smoke tests
test_large_job_queue
test_minimal
test_huggingface_glue_imdb_app
Tested (run the relevant ones):
Code formatting:
bash format.sh
Any manual or new tests for this PR (please specify below)
All smoke tests:
pytest tests/test_smoke.py
Relevant individual smoke tests:
pytest tests/test_smoke.py::test_fill_in_the_name
as aboveBackward compatibility tests:
bash tests/backward_comaptibility_tests.sh