Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FluidStack] Add FluidStack Integration (Provisioner Interface) #3086

Merged
merged 8 commits into from
Feb 20, 2024

Conversation

mjibril
Copy link
Contributor

@mjibril mjibril commented Feb 5, 2024

Added FluidStack using the Provisioner interface

  • Added _get_timeout in tests/test_smoke.py for custom cloud timeouts

  • Added new template variable containing usernames for FluidStack deployments in backend_utils.py:_write_cluster_config

Smoke tests

  • test_large_job_queue
  • test_minimal
  • test_huggingface_glue_imdb_app
  • etc

Tested (run the relevant ones):

  • Code formatting: bash format.sh

  • Any manual or new tests for this PR (please specify below)

    • sky launch -c test-single-instance --cloud fluidstack echo hi
    • sky down test-single-instance
  • All smoke tests: pytest tests/test_smoke.py

  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name as above

  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

mjibril and others added 2 commits February 5, 2024 16:31
* Added FluidStack using the Provisioner interface

* Added _get_timeout in tests/test_smoke.py for custom cloud timeouts

* Added new template variable containing usernames for FluidStack deployments in backend_utils.py:_write_cluster_config
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for moving the fluidstack implementation to the new provisioner @mjibril! The implementation looks mostly good to me. Left several comments. Will try it out later.

Comment on lines 933 to 963
@staticmethod
def _fluidstack_handler(blocked_resources: Set['resources_lib.Resources'],
launchable_resources: 'resources_lib.Resources',
region: 'clouds.Region',
zones: Optional[List['clouds.Zone']], stdout: str,
stderr: str) -> None:
del zones # Unused.
style = colorama.Style
stdout_splits = stdout.split('\n')
stderr_splits = stderr.split('\n')
errors = [
s.strip()
for s in stdout_splits + stderr_splits
if 'FluidstackAPIError:' in s.strip()
]
if not errors:
logger.info('====== stdout ======')
for s in stdout_splits:
print(s)
logger.info('====== stderr ======')
for s in stderr_splits:
print(s)
with ux_utils.print_exception_no_traceback():
raise RuntimeError('Errors occurred during provision; '
'check logs above.')

logger.warning(f'Got error(s) in {region.name}:')
messages = '\n\t'.join(errors)
logger.warning(f'{style.DIM}\t{messages}{style.RESET_ALL}')
_add_to_blocked_resources(blocked_resources,
launchable_resources.copy(zone=None))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed now, as we will use the FailoverCloudErrorHandlerV2

Suggested change
@staticmethod
def _fluidstack_handler(blocked_resources: Set['resources_lib.Resources'],
launchable_resources: 'resources_lib.Resources',
region: 'clouds.Region',
zones: Optional[List['clouds.Zone']], stdout: str,
stderr: str) -> None:
del zones # Unused.
style = colorama.Style
stdout_splits = stdout.split('\n')
stderr_splits = stderr.split('\n')
errors = [
s.strip()
for s in stdout_splits + stderr_splits
if 'FluidstackAPIError:' in s.strip()
]
if not errors:
logger.info('====== stdout ======')
for s in stdout_splits:
print(s)
logger.info('====== stderr ======')
for s in stderr_splits:
print(s)
with ux_utils.print_exception_no_traceback():
raise RuntimeError('Errors occurred during provision; '
'check logs above.')
logger.warning(f'Got error(s) in {region.name}:')
messages = '\n\t'.join(errors)
logger.warning(f'{style.DIM}\t{messages}{style.RESET_ALL}')
_add_to_blocked_resources(blocked_resources,
launchable_resources.copy(zone=None))

'IBM', 'AWS', 'Azure', 'Cloud', 'GCP', 'Lambda', 'Local', 'SCP', 'RunPod',
'OCI', 'Vsphere', 'Kubernetes', 'CloudImplementationFeatures', 'Region',
'Zone', 'CLOUD_REGISTRY', 'ProvisionerVersion', 'StatusVersion',
'Fluidstack'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: just for readability

Suggested change
'Fluidstack'
'Fluidstack',

Comment on lines 793 to 796
fluidstack_username = 'ubuntu'
if isinstance(cloud, clouds.Fluidstack):
fluidstack_username = cloud.default_username(to_provision.region)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this into the cloud.make_deploy_variables called in L781, instead of having it in this function.

Comment on lines 196 to 213
instances = _filter_instances(cluster_name_on_cloud, None)
non_running_states = [
'create', 'requesting', 'provisioning', 'customizing', 'start',
'starting', 'rebooting', 'stopping', 'stop', 'stopped', 'reboot',
'terminating'
]
status_map = {}
for state in non_running_states:
status_map[state] = status_lib.ClusterStatus.INIT

status_map['running'] = status_lib.ClusterStatus.UP
statuses: Dict[str, Optional[status_lib.ClusterStatus]] = {}
for inst_id, inst in instances.items():
status = status_map[inst['status']]
if non_terminated_only and status is None:
continue
statuses[inst_id] = status
return statuses
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we have more detailed status mapping from the Fluidstack.query_status. We should move that here.

Suggested change
instances = _filter_instances(cluster_name_on_cloud, None)
non_running_states = [
'create', 'requesting', 'provisioning', 'customizing', 'start',
'starting', 'rebooting', 'stopping', 'stop', 'stopped', 'reboot',
'terminating'
]
status_map = {}
for state in non_running_states:
status_map[state] = status_lib.ClusterStatus.INIT
status_map['running'] = status_lib.ClusterStatus.UP
statuses: Dict[str, Optional[status_lib.ClusterStatus]] = {}
for inst_id, inst in instances.items():
status = status_map[inst['status']]
if non_terminated_only and status is None:
continue
statuses[inst_id] = status
return statuses
instances = _filter_instances(cluster_name_on_cloud, None)
status_map = {
'provisioning': status_lib.ClusterStatus.INIT,
'requesting': status_lib.ClusterStatus.INIT,
'create': status_lib.ClusterStatus.INIT,
'customizing': status_lib.ClusterStatus.INIT,
'stopping': status_lib.ClusterStatus.STOPPED,
'stop': status_lib.ClusterStatus.STOPPED,
'start': status_lib.ClusterStatus.INIT,
'reboot': status_lib.ClusterStatus.STOPPED,
'rebooting': status_lib.ClusterStatus.STOPPED,
'stopped': status_lib.ClusterStatus.STOPPED,
'starting': status_lib.ClusterStatus.INIT,
'running': status_lib.ClusterStatus.UP,
'failed to create': status_lib.ClusterStatus.INIT,
'timeout error': status_lib.ClusterStatus.INIT,
'out of stock': status_lib.ClusterStatus.INIT,
}
statuses: Dict[str, Optional[status_lib.ClusterStatus]] = {}
for inst_id, inst in instances.items():
status = status_map.get(inst['status'], None)
if non_terminated_only and status is None:
continue
statuses[inst_id] = status
return statuses

Comment on lines 338 to 362
status_map = {
'provisioning': status_lib.ClusterStatus.INIT,
'requesting': status_lib.ClusterStatus.INIT,
'create': status_lib.ClusterStatus.INIT,
'customizing': status_lib.ClusterStatus.INIT,
'stopping': status_lib.ClusterStatus.STOPPED,
'stop': status_lib.ClusterStatus.STOPPED,
'start': status_lib.ClusterStatus.INIT,
'reboot': status_lib.ClusterStatus.STOPPED,
'rebooting': status_lib.ClusterStatus.STOPPED,
'stopped': status_lib.ClusterStatus.STOPPED,
'starting': status_lib.ClusterStatus.INIT,
'running': status_lib.ClusterStatus.UP,
'failed to create': status_lib.ClusterStatus.INIT,
'timeout error': status_lib.ClusterStatus.INIT,
'out of stock': status_lib.ClusterStatus.INIT,
}
status_list = []
filtered = fluidstack_utils.FluidstackClient().list_instances(
tag_filters)
for node in filtered:
node_status = status_map.get(node['status'], None)
if node_status is not None:
status_list.append(node_status)
return status_list
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer needed, as we have the sky.provision.fluidstack.instance.query_instances already.

Suggested change
status_map = {
'provisioning': status_lib.ClusterStatus.INIT,
'requesting': status_lib.ClusterStatus.INIT,
'create': status_lib.ClusterStatus.INIT,
'customizing': status_lib.ClusterStatus.INIT,
'stopping': status_lib.ClusterStatus.STOPPED,
'stop': status_lib.ClusterStatus.STOPPED,
'start': status_lib.ClusterStatus.INIT,
'reboot': status_lib.ClusterStatus.STOPPED,
'rebooting': status_lib.ClusterStatus.STOPPED,
'stopped': status_lib.ClusterStatus.STOPPED,
'starting': status_lib.ClusterStatus.INIT,
'running': status_lib.ClusterStatus.UP,
'failed to create': status_lib.ClusterStatus.INIT,
'timeout error': status_lib.ClusterStatus.INIT,
'out of stock': status_lib.ClusterStatus.INIT,
}
status_list = []
filtered = fluidstack_utils.FluidstackClient().list_instances(
tag_filters)
for node in filtered:
node_status = status_map.get(node['status'], None)
if node_status is not None:
status_list.append(node_status)
return status_list

@@ -0,0 +1,223 @@
import functools
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move this file into sky.provision.fluidstack. We should get rid of the providers folder for clouds with new provisioner API.

Comment on lines 26 to 35
{% if num_nodes > 1 %}
ray_worker_default:
min_workers: {{num_nodes - 1}}
max_workers: {{num_nodes - 1}}
resources: {}
node_config:
InstanceType: {{instance_type}}
AuthorizedKey: |
skypilot:ssh_public_key_content
{%- endif %}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer needed, as we will always create instances with the head config.

Suggested change
{% if num_nodes > 1 %}
ray_worker_default:
min_workers: {{num_nodes - 1}}
max_workers: {{num_nodes - 1}}
resources: {}
node_config:
InstanceType: {{instance_type}}
AuthorizedKey: |
skypilot:ssh_public_key_content
{%- endif %}

Comment on lines 92 to 120
# Command to start ray on the head node. You don't need to change this.
# NOTE: these are very performance-sensitive. Each new item opens/closes an SSH
# connection, which is expensive. Try your best to co-locate commands into fewer
# items! The same comment applies for worker_start_ray_commands.
#
# Increment the following for catching performance bugs easier:
# current num items (num SSH connections): 2
head_start_ray_commands:
# Start skylet daemon. (Should not place it in the head_setup_commands, otherwise it will run before skypilot is installed.)
- ((ps aux | grep -v nohup | grep -v grep | grep -q -- "python3 -m sky.skylet.skylet") || nohup python3 -m sky.skylet.skylet >> ~/.sky/skylet.log 2>&1 &);
ray stop; RAY_SCHEDULER_EVENTS=0 ray start --disable-usage-stats --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml {{"--resources='%s'" % custom_resources if custom_resources}} || exit 1;
which prlimit && for id in $(pgrep -f raylet/raylet); do sudo prlimit --nofile=1048576:1048576 --pid=$id || true; done;

{%- if num_nodes > 1 %}
worker_start_ray_commands:
- ray stop; RAY_SCHEDULER_EVENTS=0 ray start --disable-usage-stats --address=$RAY_HEAD_IP:6379 --object-manager-port=8076 {{"--resources='%s'" % custom_resources if custom_resources}} || exit 1;
which prlimit && for id in $(pgrep -f raylet/raylet); do sudo prlimit --nofile=1048576:1048576 --pid=$id || true; done;
{%- else %}
worker_start_ray_commands: []
{%- endif %}

head_node: {}
worker_nodes: {}

# These fields are required for external cloud providers.
head_setup_commands: []
worker_setup_commands: []
cluster_synced_files: []
file_mounts_sync_continuously: False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines are no longer needed. Please refer to:

# Command to start ray clusters are now placed in `sky.provision.instance_setup`.
# We do not need to list it here anymore.

@@ -784,7 +794,7 @@ def test_file_mounts(generic_cloud: str):
'using_file_mounts',
test_commands,
f'sky down -y {name}',
timeout=20 * 60, # 20 mins
(generic_cloud, 20 * 60), # 20 mins
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(generic_cloud, 20 * 60), # 20 mins
timeout=_get_timeout(generic_cloud, 20 * 60), # 20 mins


if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('catalog_dir', required=False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

required is not a valid argument for positional. How about we just keep the logic the same as our other clouds, i.e., saving to the current directory

Comment on lines 40 to 41
'FluidStack cloud does not support'
' stopping VMs. for all DCs',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
'FluidStack cloud does not support'
' stopping VMs. for all DCs',
'FluidStack cloud does not support'
' stopping VMs.',

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we do have the stopping/stopped states for a VM, is it supported by the cloud but not supported in the implementation?
If that is the case, maybe we can rephrase it to Stopping clusters in Fluidstack is not supported by SkyPilot.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a kind bump for the suggestion changes for the hint string above. : )

clouds.CloudImplementationFeatures.SPOT_INSTANCE:
'Spot instances are'
f' not supported in {_REPR}.',
clouds.CloudImplementationFeatures.DOCKER_IMAGE:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
clouds.CloudImplementationFeatures.DOCKER_IMAGE:
clouds.CloudImplementationFeatures.IMAGE_ID: f'Specifying image ID is not supported for {_REPR}.',
clouds.CloudImplementationFeatures.DOCKER_IMAGE:

Comment on lines 62 to 65
@classmethod
def _cloud_unsupported_features(
cls) -> Dict[clouds.CloudImplementationFeatures, str]:
return cls._CLOUD_UNSUPPORTED_FEATURES
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer needed.

Suggested change
@classmethod
def _cloud_unsupported_features(
cls) -> Dict[clouds.CloudImplementationFeatures, str]:
return cls._CLOUD_UNSUPPORTED_FEATURES


@classmethod
def get_current_user_identity(cls) -> Optional[List[str]]:
# TODO(ewzeng): Implement get_current_user_identity for Fluidstack
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
# TODO(ewzeng): Implement get_current_user_identity for Fluidstack
# TODO: Implement get_current_user_identity for Fluidstack

Comment on lines 22 to 25
DEFAULT_FLUIDSTACK_API_KEY_PATH = os.path.expanduser(
'~/.fluidstack/fluidstack_api_key')
DEFAULT_FLUIDSTACK_API_TOKEN_PATH = os.path.expanduser(
'~/.fluidstack/fluidstack_api_token')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these be ~/.fluidstack/api_key and ~/.fluidstack/api_token instead, which is hinted in clouds.Fluidstack.check_credentials

Comment on lines 76 to 77
sudo apt update;
sudo apt-get -y install libcudnn8=8.8.0.121-1+cuda11.8 libcudnn8-dev=8.8.0.121-1+cuda11.8 libnccl2=2.16.5-1+cuda11.8 libnccl-dev=2.16.5-1+cuda11.8 cuda-toolkit-11-8 cuda-11-8 build-essential devscripts debhelper fakeroot cuda-drivers=520.61.05-1 python3-pip cuda-nvcc-11-8 linux-generic-hwe-20.04 tensorrt=8.5.3.1-1+cuda11.8 libnvinfer8=8.5.3-1+cuda11.8 libnvinfer-plugin8=8.5.3-1+cuda11.8 libnvparsers8=8.5.3-1+cuda11.8 libnvonnxparsers8=8.5.3-1+cuda11.8 libnvinfer-bin=8.5.3-1+cuda11.8 libnvinfer-dev=8.5.3-1+cuda11.8 libnvinfer-plugin-dev=8.5.3-1+cuda11.8 libnvparsers-dev=8.5.3-1+cuda11.8 libnvonnxparsers-dev=8.5.3-1+cuda11.8 libnvinfer-samples=8.5.3-1+cuda11.8;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can take a significant amount of time. Is it possible to use a image that has those drivers installed already? Also, it would be better to install the cuda 12.2/12.1 instead of 11.8 to be aligned with the other clouds.

global _fluidstack_sdk
if _fluidstack_sdk is None:
try:
import fluidstack as _fluidstack # pylint: disable=import-outside-toplevel
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should update the setup.py to have an extra for fluidstack that installs the dependency.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, actually, it seems we are not using fluidstack package anywhere. Should we leave this file out?

instances[instance_id] = [
common.InstanceInfo(
instance_id=instance_id,
internal_ip=instance_info['ip_address'],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need the private IP to have it work for job scheduling on multi-node cluster. Is it possible to set the private IP for the internal IP instead?

Copy link
Collaborator

@Michaelvll Michaelvll Feb 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the instance_info does not return the private IP of the instance, a easy way to solve this is to ssh into the instance and get the private IP manually.

from sky import authentication as auth
from sky.utils import command_runner
from sky.utils import subprocess_utils
_GET_INTERNAL_IP_CMD = 'ip -4 -br addr show | grep UP | grep -Eo "(10\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)|172\.(1[6-9]|2[0-9]|3[0-1]))\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)"'

def get_internal_ip(node_info: Dict[str, Any]) -> str:
    runner = command_runner.SSHCommandRunner(node_info['ip_address'], ssh_user=node_info['capabilities']['default_user_name'], ssh_private_key=auth.PRIVATE_SSH_KEY_PATH)
    rc, stdout, stderr = runner.run(_GET_INTERNAL_IP_CMD, require_outputs=True, stream_logs=False)
    subprocess_utils.handle_returncode(rc, _GET_INTERNAL_IP_CMD, 'Failed get obtain private IP from node', stderr=stdout+stderr)
    node_info['internal_ip'] = stdout.strip()

subprocess_utils.run_in_parallel(get_internal_ip, running_instances.values())
head_instance_id = None
for instance_id, instance_info in running_instances.items():
    instance_id = instance_info['id']
    instances[instance_id] = [
        common.InstanceInfo(
            instance_id=instance_id,
            internal_ip=instance_info['internal_ip'],
            external_ip=instance_info['ip_address'],
            ssh_port=instance_info['ssh_port'],
            tags={},
        )
    ]
    if instance_info['hostname'].endswith('-head'):
        head_instance_id = instance_id

The following is the entire changes you may want to refer to ; )
781c630

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a kind bump for adopting the changes from the commit 781c630, otherwise the multi-node support will still fail.

mjibril and others added 5 commits February 12, 2024 10:33
     * changes requested for in PR comments
* use public ip addresses for nodes if internal ips not available

* upgrade CUDA installation commands

* use images with NVIDIA drivers pre-installed where available
Merge branch 'fluidstack-provisioner' of github.com:fluidstackio/skypilot into fluidstack-provisioner
* Removed deprecated skylet/providers/fluidstack directory

* Fix pylint warnings

* Removed deprecated code in cloud_vm_ray_backend.py
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick fix @mjibril! The PR looks mostly good to me with some bumps for the changes proposed in the comments, especially: #3086 (comment)

sky/clouds/fluidstack.py Show resolved Hide resolved
Comment on lines 37 to 41
# subprocess_utils.handle_returncode(rc,
# _GET_INTERNAL_IP_CMD,
# 'Failed get obtain private IP from node',
# stderr=stdout + stderr)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably raise here to fail early when we fail to get internal IP. Do we want to uncomment this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some DCs, due to the virtualisation/configuration, ip -4 -br addr show | grep UP only shows the external IP address, which fails the regex.

Copy link
Collaborator

@Michaelvll Michaelvll Feb 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, I see. Could we add a comment here and remove this comment for handle_returncode?

# Some DCs do not have internal IPs and can fail when getting the IP. We set the `internal_ip` to the same as
# external IP. It should be fine as the `ray cluster` will also get and use that external IP in that case.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just tried the region without the internal IP, and it seems working correctly. Thanks for fix here.

nit: it would be good to change logger.error above to logger.debug as well, since whether or not the internal IP presents here will not affect the end-user's usage of the VM. : )

sky/provision/fluidstack/instance.py Show resolved Hide resolved
Comment on lines 40 to 41
'FluidStack cloud does not support'
' stopping VMs. for all DCs',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a kind bump for the suggestion changes for the hint string above. : )

instances[instance_id] = [
common.InstanceInfo(
instance_id=instance_id,
internal_ip=instance_info['ip_address'],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a kind bump for adopting the changes from the commit 781c630, otherwise the multi-node support will still fail.

@mjibril mjibril force-pushed the fluidstack-provisioner branch 6 times, most recently from 3a8a785 to b03de47 Compare February 16, 2024 16:06
Comment on lines 305 to 309
@classmethod
def check_disk_tier_enabled(cls, instance_type: Optional[str],
disk_tier: DiskTier) -> None:
raise exceptions.NotSupportedError(
'Disk tier is not supported by FluidStack.')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed. Removing this will also fix the CI error. : )

Suggested change
@classmethod
def check_disk_tier_enabled(cls, instance_type: Optional[str],
disk_tier: DiskTier) -> None:
raise exceptions.NotSupportedError(
'Disk tier is not supported by FluidStack.')
@classmethod
def check_disk_tier_enabled(cls, instance_type: Optional[str],
disk_tier: DiskTier) -> None:
raise exceptions.NotSupportedError(
'Disk tier is not supported by FluidStack.')

sudo apt-get -y install cuda-toolkit-12-3;
sudo apt-get install -y cuda-drivers;
sudo apt-get install -y python3-pip;
nvidia-smi;"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
nvidia-smi;"""
nvidia-smi || sudo reboot;"""

This will fix the issue where the nvidia-smi fail to be found issue in the user job.

Comment on lines 37 to 41
# subprocess_utils.handle_returncode(rc,
# _GET_INTERNAL_IP_CMD,
# 'Failed get obtain private IP from node',
# stderr=stdout + stderr)
Copy link
Collaborator

@Michaelvll Michaelvll Feb 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, I see. Could we add a comment here and remove this comment for handle_returncode?

# Some DCs do not have internal IPs and can fail when getting the IP. We set the `internal_ip` to the same as
# external IP. It should be fine as the `ray cluster` will also get and use that external IP in that case.

@@ -267,6 +274,7 @@ def test_minimal(generic_cloud: str):
f'sky logs {name} 2 --status', # Ensure the job succeeded.
],
f'sky down -y {name}',
_get_timeout(generic_cloud),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we revert this _get_timeout? It seems this function is only used in the tests that has no_fluidstack mark.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_get_timeout is used in test_minimal, test_inline_env_file, test_multi_hostname, test_inline_env_file and test_file_mounts. These tests are applicable to fluidstack. Without extended timeouts, the smoke tests may fail due to the additional time required to install packages.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, makes sense! I misread that. It looks good to me.

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great effort for supporting Fluidstack in SkyPilot @mjibril! The PR looks mostly good to me. Left some final comments and I think we should be ready to go after they are resolved.

sky/provision/fluidstack/instance.py Show resolved Hide resolved
@mjibril mjibril force-pushed the fluidstack-provisioner branch 2 times, most recently from 6020f61 to cdb419f Compare February 19, 2024 19:38
* Reformat after pull from skypilot/master

* Implement multi-node support for FluidStack

* Use CUDA installation on plain distro image

* Removed `check_disk_tier_enabled` from `fluidstack.py`
@Michaelvll Michaelvll merged commit 3ba7032 into skypilot-org:master Feb 20, 2024
19 checks passed
This was referenced Mar 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants