[FluidStack] Add FluidStack Integration (Provisioner Interface) #3086

mjibril · 2024-02-05T15:46:07Z

Added FluidStack using the Provisioner interface

Added _get_timeout in tests/test_smoke.py for custom cloud timeouts
Added new template variable containing usernames for FluidStack deployments in backend_utils.py:_write_cluster_config

Smoke tests

test_large_job_queue
test_minimal
test_huggingface_glue_imdb_app
etc

Tested (run the relevant ones):

Code formatting: bash format.sh
Any manual or new tests for this PR (please specify below)
- sky launch -c test-single-instance --cloud fluidstack echo hi
- sky down test-single-instance
All smoke tests: pytest tests/test_smoke.py
Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name as above
Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

* Added FluidStack using the Provisioner interface * Added _get_timeout in tests/test_smoke.py for custom cloud timeouts * Added new template variable containing usernames for FluidStack deployments in backend_utils.py:_write_cluster_config

Michaelvll

Thanks for moving the fluidstack implementation to the new provisioner @mjibril! The implementation looks mostly good to me. Left several comments. Will try it out later.

Michaelvll · 2024-02-05T17:04:06Z

sky/backends/cloud_vm_ray_backend.py

+    @staticmethod
+    def _fluidstack_handler(blocked_resources: Set['resources_lib.Resources'],
+                            launchable_resources: 'resources_lib.Resources',
+                            region: 'clouds.Region',
+                            zones: Optional[List['clouds.Zone']], stdout: str,
+                            stderr: str) -> None:
+        del zones  # Unused.
+        style = colorama.Style
+        stdout_splits = stdout.split('\n')
+        stderr_splits = stderr.split('\n')
+        errors = [
+            s.strip()
+            for s in stdout_splits + stderr_splits
+            if 'FluidstackAPIError:' in s.strip()
+        ]
+        if not errors:
+            logger.info('====== stdout ======')
+            for s in stdout_splits:
+                print(s)
+            logger.info('====== stderr ======')
+            for s in stderr_splits:
+                print(s)
+            with ux_utils.print_exception_no_traceback():
+                raise RuntimeError('Errors occurred during provision; '
+                                   'check logs above.')
+
+        logger.warning(f'Got error(s) in {region.name}:')
+        messages = '\n\t'.join(errors)
+        logger.warning(f'{style.DIM}\t{messages}{style.RESET_ALL}')
+        _add_to_blocked_resources(blocked_resources,
+                                  launchable_resources.copy(zone=None))


This is not needed now, as we will use the FailoverCloudErrorHandlerV2

Suggested change

@staticmethod

def _fluidstack_handler(blocked_resources: Set['resources_lib.Resources'],

launchable_resources: 'resources_lib.Resources',

region: 'clouds.Region',

zones: Optional[List['clouds.Zone']], stdout: str,

stderr: str) -> None:

del zones # Unused.

style = colorama.Style

stdout_splits = stdout.split('\n')

stderr_splits = stderr.split('\n')

errors = [

s.strip()

for s in stdout_splits + stderr_splits

if 'FluidstackAPIError:' in s.strip()

]

if not errors:

logger.info('====== stdout ======')

for s in stdout_splits:

print(s)

logger.info('====== stderr ======')

for s in stderr_splits:

print(s)

with ux_utils.print_exception_no_traceback():

raise RuntimeError('Errors occurred during provision; '

'check logs above.')

logger.warning(f'Got error(s) in {region.name}:')

messages = '\n\t'.join(errors)

logger.warning(f'{style.DIM}\t{messages}{style.RESET_ALL}')

_add_to_blocked_resources(blocked_resources,

launchable_resources.copy(zone=None))

Michaelvll · 2024-02-05T17:04:35Z

sky/clouds/__init__.py

+    'IBM', 'AWS', 'Azure', 'Cloud', 'GCP', 'Lambda', 'Local', 'SCP', 'RunPod',
+    'OCI', 'Vsphere', 'Kubernetes', 'CloudImplementationFeatures', 'Region',
+    'Zone', 'CLOUD_REGISTRY', 'ProvisionerVersion', 'StatusVersion',
+    'Fluidstack'


nit: just for readability

Suggested change

'Fluidstack'

'Fluidstack',

Michaelvll · 2024-02-05T17:46:52Z

sky/backends/backend_utils.py

+    fluidstack_username = 'ubuntu'
+    if isinstance(cloud, clouds.Fluidstack):
+        fluidstack_username = cloud.default_username(to_provision.region)
+


Let's move this into the cloud.make_deploy_variables called in L781, instead of having it in this function.

Michaelvll · 2024-02-05T17:51:33Z

sky/provision/fluidstack/instance.py

+    instances = _filter_instances(cluster_name_on_cloud, None)
+    non_running_states = [
+        'create', 'requesting', 'provisioning', 'customizing', 'start',
+        'starting', 'rebooting', 'stopping', 'stop', 'stopped', 'reboot',
+        'terminating'
+    ]
+    status_map = {}
+    for state in non_running_states:
+        status_map[state] = status_lib.ClusterStatus.INIT
+
+    status_map['running'] = status_lib.ClusterStatus.UP
+    statuses: Dict[str, Optional[status_lib.ClusterStatus]] = {}
+    for inst_id, inst in instances.items():
+        status = status_map[inst['status']]
+        if non_terminated_only and status is None:
+            continue
+        statuses[inst_id] = status
+    return statuses


It seems we have more detailed status mapping from the Fluidstack.query_status. We should move that here.

Suggested change

instances = _filter_instances(cluster_name_on_cloud, None)

non_running_states = [

'create', 'requesting', 'provisioning', 'customizing', 'start',

'starting', 'rebooting', 'stopping', 'stop', 'stopped', 'reboot',

'terminating'

]

status_map = {}

for state in non_running_states:

status_map[state] = status_lib.ClusterStatus.INIT

status_map['running'] = status_lib.ClusterStatus.UP

statuses: Dict[str, Optional[status_lib.ClusterStatus]] = {}

for inst_id, inst in instances.items():

status = status_map[inst['status']]

if non_terminated_only and status is None:

continue

statuses[inst_id] = status

return statuses

instances = _filter_instances(cluster_name_on_cloud, None)

status_map = {

'provisioning': status_lib.ClusterStatus.INIT,

'requesting': status_lib.ClusterStatus.INIT,

'create': status_lib.ClusterStatus.INIT,

'customizing': status_lib.ClusterStatus.INIT,

'stopping': status_lib.ClusterStatus.STOPPED,

'stop': status_lib.ClusterStatus.STOPPED,

'start': status_lib.ClusterStatus.INIT,

'reboot': status_lib.ClusterStatus.STOPPED,

'rebooting': status_lib.ClusterStatus.STOPPED,

'stopped': status_lib.ClusterStatus.STOPPED,

'starting': status_lib.ClusterStatus.INIT,

'running': status_lib.ClusterStatus.UP,

'failed to create': status_lib.ClusterStatus.INIT,

'timeout error': status_lib.ClusterStatus.INIT,

'out of stock': status_lib.ClusterStatus.INIT,

}

statuses: Dict[str, Optional[status_lib.ClusterStatus]] = {}

for inst_id, inst in instances.items():

status = status_map.get(inst['status'], None)

if non_terminated_only and status is None:

continue

statuses[inst_id] = status

return statuses

Michaelvll · 2024-02-05T17:52:56Z

sky/clouds/fluidstack.py

+        status_map = {
+            'provisioning': status_lib.ClusterStatus.INIT,
+            'requesting': status_lib.ClusterStatus.INIT,
+            'create': status_lib.ClusterStatus.INIT,
+            'customizing': status_lib.ClusterStatus.INIT,
+            'stopping': status_lib.ClusterStatus.STOPPED,
+            'stop': status_lib.ClusterStatus.STOPPED,
+            'start': status_lib.ClusterStatus.INIT,
+            'reboot': status_lib.ClusterStatus.STOPPED,
+            'rebooting': status_lib.ClusterStatus.STOPPED,
+            'stopped': status_lib.ClusterStatus.STOPPED,
+            'starting': status_lib.ClusterStatus.INIT,
+            'running': status_lib.ClusterStatus.UP,
+            'failed to create': status_lib.ClusterStatus.INIT,
+            'timeout error': status_lib.ClusterStatus.INIT,
+            'out of stock': status_lib.ClusterStatus.INIT,
+        }
+        status_list = []
+        filtered = fluidstack_utils.FluidstackClient().list_instances(
+            tag_filters)
+        for node in filtered:
+            node_status = status_map.get(node['status'], None)
+            if node_status is not None:
+                status_list.append(node_status)
+        return status_list


This is no longer needed, as we have the sky.provision.fluidstack.instance.query_instances already.

Suggested change

status_map = {

'provisioning': status_lib.ClusterStatus.INIT,

'requesting': status_lib.ClusterStatus.INIT,

'create': status_lib.ClusterStatus.INIT,

'customizing': status_lib.ClusterStatus.INIT,

'stopping': status_lib.ClusterStatus.STOPPED,

'stop': status_lib.ClusterStatus.STOPPED,

'start': status_lib.ClusterStatus.INIT,

'reboot': status_lib.ClusterStatus.STOPPED,

'rebooting': status_lib.ClusterStatus.STOPPED,

'stopped': status_lib.ClusterStatus.STOPPED,

'starting': status_lib.ClusterStatus.INIT,

'running': status_lib.ClusterStatus.UP,

'failed to create': status_lib.ClusterStatus.INIT,

'timeout error': status_lib.ClusterStatus.INIT,

'out of stock': status_lib.ClusterStatus.INIT,

}

status_list = []

filtered = fluidstack_utils.FluidstackClient().list_instances(

tag_filters)

for node in filtered:

node_status = status_map.get(node['status'], None)

if node_status is not None:

status_list.append(node_status)

return status_list

Michaelvll · 2024-02-05T17:59:11Z

sky/skylet/providers/fluidstack/fluidstack_utils.py

@@ -0,0 +1,223 @@
+import functools


Please move this file into sky.provision.fluidstack. We should get rid of the providers folder for clouds with new provisioner API.

Michaelvll · 2024-02-05T18:00:47Z

sky/templates/fluidstack-ray.yml.j2

+{% if num_nodes > 1 %}
+  ray_worker_default:
+    min_workers: {{num_nodes - 1}}
+    max_workers: {{num_nodes - 1}}
+    resources: {}
+    node_config:
+      InstanceType: {{instance_type}}
+      AuthorizedKey: |
+        skypilot:ssh_public_key_content
+{%- endif %}


This is no longer needed, as we will always create instances with the head config.

Suggested change

{% if num_nodes > 1 %}

ray_worker_default:

min_workers: {{num_nodes - 1}}

max_workers: {{num_nodes - 1}}

resources: {}

node_config:

InstanceType: {{instance_type}}

AuthorizedKey: |

skypilot:ssh_public_key_content

{%- endif %}

Michaelvll · 2024-02-05T18:02:32Z

sky/templates/fluidstack-ray.yml.j2

+# Command to start ray on the head node. You don't need to change this.
+# NOTE: these are very performance-sensitive. Each new item opens/closes an SSH
+# connection, which is expensive. Try your best to co-locate commands into fewer
+# items! The same comment applies for worker_start_ray_commands.
+#
+# Increment the following for catching performance bugs easier:
+#   current num items (num SSH connections): 2
+head_start_ray_commands:
+  # Start skylet daemon. (Should not place it in the head_setup_commands, otherwise it will run before skypilot is installed.)
+  - ((ps aux | grep -v nohup | grep -v grep | grep -q -- "python3 -m sky.skylet.skylet") || nohup python3 -m sky.skylet.skylet >> ~/.sky/skylet.log 2>&1 &);
+    ray stop; RAY_SCHEDULER_EVENTS=0 ray start --disable-usage-stats --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml {{"--resources='%s'" % custom_resources if custom_resources}} || exit 1;
+    which prlimit && for id in $(pgrep -f raylet/raylet); do sudo prlimit --nofile=1048576:1048576 --pid=$id || true; done;
+
+{%- if num_nodes > 1 %}
+worker_start_ray_commands:
+  - ray stop; RAY_SCHEDULER_EVENTS=0 ray start --disable-usage-stats --address=$RAY_HEAD_IP:6379 --object-manager-port=8076 {{"--resources='%s'" % custom_resources if custom_resources}} || exit 1;
+    which prlimit && for id in $(pgrep -f raylet/raylet); do sudo prlimit --nofile=1048576:1048576 --pid=$id || true; done;
+{%- else %}
+worker_start_ray_commands: []
+{%- endif %}
+
+head_node: {}
+worker_nodes: {}
+
+# These fields are required for external cloud providers.
+head_setup_commands: []
+worker_setup_commands: []
+cluster_synced_files: []
+file_mounts_sync_continuously: False


These lines are no longer needed. Please refer to:

skypilot/sky/templates/aws-ray.yml.j2

Lines 173 to 176 in 5cccd0f

# Command to start ray clusters are now placed in `sky.provision.instance_setup`.

# We do not need to list it here anymore.

Michaelvll · 2024-02-05T18:03:46Z

tests/test_smoke.py

@@ -784,7 +794,7 @@ def test_file_mounts(generic_cloud: str):
        'using_file_mounts',
        test_commands,
        f'sky down -y {name}',
-        timeout=20 * 60,  # 20 mins
+        (generic_cloud, 20 * 60),  # 20 mins


Suggested change

(generic_cloud, 20 * 60), # 20 mins

timeout=_get_timeout(generic_cloud, 20 * 60), # 20 mins

Michaelvll · 2024-02-05T18:15:54Z

sky/clouds/service_catalog/data_fetchers/fetch_fluidstack.py

+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument('catalog_dir', required=False)


required is not a valid argument for positional. How about we just keep the logic the same as our other clouds, i.e., saving to the current directory

Michaelvll · 2024-02-06T17:50:55Z

sky/clouds/fluidstack.py

+            'FluidStack cloud does not support'
+            ' stopping VMs. for all DCs',


Suggested change

'FluidStack cloud does not support'

' stopping VMs. for all DCs',

'FluidStack cloud does not support'

' stopping VMs.',

It seems we do have the stopping/stopped states for a VM, is it supported by the cloud but not supported in the implementation?
If that is the case, maybe we can rephrase it to Stopping clusters in Fluidstack is not supported by SkyPilot.

Just a kind bump for the suggestion changes for the hint string above. : )

Michaelvll · 2024-02-06T17:53:57Z

sky/clouds/fluidstack.py

+        clouds.CloudImplementationFeatures.SPOT_INSTANCE:
+            'Spot instances are'
+            f' not supported in {_REPR}.',
+        clouds.CloudImplementationFeatures.DOCKER_IMAGE:


Suggested change

clouds.CloudImplementationFeatures.DOCKER_IMAGE:

clouds.CloudImplementationFeatures.IMAGE_ID: f'Specifying image ID is not supported for {_REPR}.',

clouds.CloudImplementationFeatures.DOCKER_IMAGE:

Michaelvll · 2024-02-06T17:54:33Z

sky/clouds/fluidstack.py

+    @classmethod
+    def _cloud_unsupported_features(
+            cls) -> Dict[clouds.CloudImplementationFeatures, str]:
+        return cls._CLOUD_UNSUPPORTED_FEATURES


This is no longer needed.

Suggested change

@classmethod

def _cloud_unsupported_features(

cls) -> Dict[clouds.CloudImplementationFeatures, str]:

return cls._CLOUD_UNSUPPORTED_FEATURES

Michaelvll · 2024-02-06T17:57:18Z

sky/clouds/fluidstack.py

+
+    @classmethod
+    def get_current_user_identity(cls) -> Optional[List[str]]:
+        # TODO(ewzeng): Implement get_current_user_identity for Fluidstack


nit:

Suggested change

# TODO(ewzeng): Implement get_current_user_identity for Fluidstack

# TODO: Implement get_current_user_identity for Fluidstack

Michaelvll · 2024-02-06T17:59:20Z

sky/clouds/service_catalog/data_fetchers/fetch_fluidstack.py

+DEFAULT_FLUIDSTACK_API_KEY_PATH = os.path.expanduser(
+    '~/.fluidstack/fluidstack_api_key')
+DEFAULT_FLUIDSTACK_API_TOKEN_PATH = os.path.expanduser(
+    '~/.fluidstack/fluidstack_api_token')


Should these be ~/.fluidstack/api_key and ~/.fluidstack/api_token instead, which is hinted in clouds.Fluidstack.check_credentials

Michaelvll · 2024-02-06T18:02:33Z

sky/templates/fluidstack-ray.yml.j2

+    sudo apt update;
+    sudo apt-get -y install libcudnn8=8.8.0.121-1+cuda11.8 libcudnn8-dev=8.8.0.121-1+cuda11.8 libnccl2=2.16.5-1+cuda11.8 libnccl-dev=2.16.5-1+cuda11.8 cuda-toolkit-11-8 cuda-11-8 build-essential devscripts debhelper fakeroot cuda-drivers=520.61.05-1 python3-pip cuda-nvcc-11-8 linux-generic-hwe-20.04 tensorrt=8.5.3.1-1+cuda11.8 libnvinfer8=8.5.3-1+cuda11.8 libnvinfer-plugin8=8.5.3-1+cuda11.8 libnvparsers8=8.5.3-1+cuda11.8 libnvonnxparsers8=8.5.3-1+cuda11.8 libnvinfer-bin=8.5.3-1+cuda11.8 libnvinfer-dev=8.5.3-1+cuda11.8 libnvinfer-plugin-dev=8.5.3-1+cuda11.8 libnvparsers-dev=8.5.3-1+cuda11.8 libnvonnxparsers-dev=8.5.3-1+cuda11.8 libnvinfer-samples=8.5.3-1+cuda11.8;


This can take a significant amount of time. Is it possible to use a image that has those drivers installed already? Also, it would be better to install the cuda 12.2/12.1 instead of 11.8 to be aligned with the other clouds.

Michaelvll · 2024-02-06T18:05:56Z

sky/adaptors/fluidstack.py

+        global _fluidstack_sdk
+        if _fluidstack_sdk is None:
+            try:
+                import fluidstack as _fluidstack  # pylint: disable=import-outside-toplevel


We should update the setup.py to have an extra for fluidstack that installs the dependency.

Hmm, actually, it seems we are not using fluidstack package anywhere. Should we leave this file out?

Michaelvll · 2024-02-06T18:30:59Z

sky/provision/fluidstack/instance.py

+        instances[instance_id] = [
+            common.InstanceInfo(
+                instance_id=instance_id,
+                internal_ip=instance_info['ip_address'],


We need the private IP to have it work for job scheduling on multi-node cluster. Is it possible to set the private IP for the internal IP instead?

If the instance_info does not return the private IP of the instance, a easy way to solve this is to ssh into the instance and get the private IP manually.

from sky import authentication as auth from sky.utils import command_runner from sky.utils import subprocess_utils _GET_INTERNAL_IP_CMD = 'ip -4 -br addr show | grep UP | grep -Eo "(10\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)|172\.(1[6-9]|2[0-9]|3[0-1]))\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)"' def get_internal_ip(node_info: Dict[str, Any]) -> str: runner = command_runner.SSHCommandRunner(node_info['ip_address'], ssh_user=node_info['capabilities']['default_user_name'], ssh_private_key=auth.PRIVATE_SSH_KEY_PATH) rc, stdout, stderr = runner.run(_GET_INTERNAL_IP_CMD, require_outputs=True, stream_logs=False) subprocess_utils.handle_returncode(rc, _GET_INTERNAL_IP_CMD, 'Failed get obtain private IP from node', stderr=stdout+stderr) node_info['internal_ip'] = stdout.strip() subprocess_utils.run_in_parallel(get_internal_ip, running_instances.values()) head_instance_id = None for instance_id, instance_info in running_instances.items(): instance_id = instance_info['id'] instances[instance_id] = [ common.InstanceInfo( instance_id=instance_id, internal_ip=instance_info['internal_ip'], external_ip=instance_info['ip_address'], ssh_port=instance_info['ssh_port'], tags={}, ) ] if instance_info['hostname'].endswith('-head'): head_instance_id = instance_id

The following is the entire changes you may want to refer to ; )
781c630

Just a kind bump for adopting the changes from the commit 781c630, otherwise the multi-node support will still fail.

* changes requested for in PR comments

* use public ip addresses for nodes if internal ips not available * upgrade CUDA installation commands * use images with NVIDIA drivers pre-installed where available

Merge branch 'fluidstack-provisioner' of github.com:fluidstackio/skypilot into fluidstack-provisioner

* Removed deprecated skylet/providers/fluidstack directory * Fix pylint warnings * Removed deprecated code in cloud_vm_ray_backend.py

Michaelvll

Thanks for the quick fix @mjibril! The PR looks mostly good to me with some bumps for the changes proposed in the comments, especially: #3086 (comment)

sky/clouds/fluidstack.py

Michaelvll · 2024-02-13T18:25:58Z

sky/provision/fluidstack/instance.py

+    # subprocess_utils.handle_returncode(rc,
+    #                                  _GET_INTERNAL_IP_CMD,
+    #                                  'Failed get obtain private IP from node',
+    #                                   stderr=stdout + stderr)


We should probably raise here to fail early when we fail to get internal IP. Do we want to uncomment this?

In some DCs, due to the virtualisation/configuration, ip -4 -br addr show | grep UP only shows the external IP address, which fails the regex.

Ahh, I see. Could we add a comment here and remove this comment for handle_returncode?

# Some DCs do not have internal IPs and can fail when getting the IP. We set the `internal_ip` to the same as # external IP. It should be fine as the `ray cluster` will also get and use that external IP in that case.

Just tried the region without the internal IP, and it seems working correctly. Thanks for fix here.

nit: it would be good to change logger.error above to logger.debug as well, since whether or not the internal IP presents here will not affect the end-user's usage of the VM. : )

sky/provision/fluidstack/instance.py

Michaelvll · 2024-02-13T18:33:55Z

sky/clouds/fluidstack.py

+            'FluidStack cloud does not support'
+            ' stopping VMs. for all DCs',


Just a kind bump for the suggestion changes for the hint string above. : )

Michaelvll · 2024-02-13T18:38:34Z

sky/provision/fluidstack/instance.py

+        instances[instance_id] = [
+            common.InstanceInfo(
+                instance_id=instance_id,
+                internal_ip=instance_info['ip_address'],


Just a kind bump for adopting the changes from the commit 781c630, otherwise the multi-node support will still fail.

sky/clouds/service_catalog/fluidstack_catalog.py

sky/clouds/fluidstack.py

Michaelvll · 2024-02-16T18:08:46Z

sky/clouds/fluidstack.py

+    @classmethod
+    def check_disk_tier_enabled(cls, instance_type: Optional[str],
+                                disk_tier: DiskTier) -> None:
+        raise exceptions.NotSupportedError(
+            'Disk tier is not supported by FluidStack.')


This is not needed. Removing this will also fix the CI error. : )

Suggested change

@classmethod

def check_disk_tier_enabled(cls, instance_type: Optional[str],

disk_tier: DiskTier) -> None:

raise exceptions.NotSupportedError(

'Disk tier is not supported by FluidStack.')

@classmethod

def check_disk_tier_enabled(cls, instance_type: Optional[str],

disk_tier: DiskTier) -> None:

raise exceptions.NotSupportedError(

'Disk tier is not supported by FluidStack.')

Michaelvll · 2024-02-16T21:02:46Z

sky/clouds/fluidstack.py

+        sudo apt-get -y install cuda-toolkit-12-3;
+        sudo apt-get install -y cuda-drivers;
+        sudo apt-get install -y python3-pip;
+        nvidia-smi;"""


Suggested change

nvidia-smi;"""

nvidia-smi || sudo reboot;"""

This will fix the issue where the nvidia-smi fail to be found issue in the user job.

Michaelvll · 2024-02-16T22:37:58Z

sky/provision/fluidstack/instance.py

+    # subprocess_utils.handle_returncode(rc,
+    #                                  _GET_INTERNAL_IP_CMD,
+    #                                  'Failed get obtain private IP from node',
+    #                                   stderr=stdout + stderr)


Ahh, I see. Could we add a comment here and remove this comment for handle_returncode?

# Some DCs do not have internal IPs and can fail when getting the IP. We set the `internal_ip` to the same as # external IP. It should be fine as the `ray cluster` will also get and use that external IP in that case.

Michaelvll · 2024-02-16T22:51:53Z

tests/test_smoke.py

@@ -267,6 +274,7 @@ def test_minimal(generic_cloud: str):
            f'sky logs {name} 2 --status',  # Ensure the job succeeded.
        ],
        f'sky down -y {name}',
+        _get_timeout(generic_cloud),


Can we revert this _get_timeout? It seems this function is only used in the tests that has no_fluidstack mark.

_get_timeout is used in test_minimal, test_inline_env_file, test_multi_hostname, test_inline_env_file and test_file_mounts. These tests are applicable to fluidstack. Without extended timeouts, the smoke tests may fail due to the additional time required to install packages.

Oh, makes sense! I misread that. It looks good to me.

Michaelvll

Thanks for the great effort for supporting Fluidstack in SkyPilot @mjibril! The PR looks mostly good to me. Left some final comments and I think we should be ready to go after they are resolved.

sky/provision/fluidstack/instance.py

sky/clouds/service_catalog/fluidstack_catalog.py

* Reformat after pull from skypilot/master * Implement multi-node support for FluidStack * Use CUDA installation on plain distro image * Removed `check_disk_tier_enabled` from `fluidstack.py`

mjibril and others added 2 commits February 5, 2024 16:31

[FluidStack] Add FluidStack Integration

2fbb563

* Added FluidStack using the Provisioner interface * Added _get_timeout in tests/test_smoke.py for custom cloud timeouts * Added new template variable containing usernames for FluidStack deployments in backend_utils.py:_write_cluster_config

Merge branch 'master' into fluidstack-provisioner

40c70f1

Michaelvll reviewed Feb 5, 2024

View reviewed changes

Michaelvll reviewed Feb 6, 2024

View reviewed changes

mjibril and others added 5 commits February 12, 2024 10:33

[FluidStack] PR Changes

cfffd43

* changes requested for in PR comments

[FluidStack] Pull from skypilot/master

4b79c03

* use public ip addresses for nodes if internal ips not available * upgrade CUDA installation commands * use images with NVIDIA drivers pre-installed where available

[FluidStack] Merge with upstream

e4699f5

Merge branch 'fluidstack-provisioner' of github.com:fluidstackio/skypilot into fluidstack-provisioner

[FluidStack] Additional PR changes

bf0283c

* Removed deprecated skylet/providers/fluidstack directory * Fix pylint warnings * Removed deprecated code in cloud_vm_ray_backend.py

Merge branch 'master' into fluidstack-provisioner

366c380

mjibril force-pushed the fluidstack-provisioner branch from 1f6412c to 85af027 Compare February 13, 2024 10:46

Michaelvll reviewed Feb 13, 2024

View reviewed changes

sky/clouds/service_catalog/fluidstack_catalog.py Show resolved Hide resolved

Michaelvll reviewed Feb 13, 2024

View reviewed changes

sky/clouds/fluidstack.py Outdated Show resolved Hide resolved

mjibril force-pushed the fluidstack-provisioner branch 6 times, most recently from 3a8a785 to b03de47 Compare February 16, 2024 16:06

Michaelvll reviewed Feb 16, 2024

View reviewed changes

mjibril force-pushed the fluidstack-provisioner branch from b03de47 to 9b11fcb Compare February 16, 2024 21:27

Michaelvll reviewed Feb 16, 2024

View reviewed changes

Michaelvll approved these changes Feb 16, 2024

View reviewed changes

sky/provision/fluidstack/instance.py Show resolved Hide resolved

mjibril force-pushed the fluidstack-provisioner branch 2 times, most recently from 6020f61 to cdb419f Compare February 19, 2024 19:38

Michaelvll reviewed Feb 19, 2024

View reviewed changes

sky/clouds/service_catalog/fluidstack_catalog.py Outdated Show resolved Hide resolved

Michaelvll mentioned this pull request Feb 19, 2024

[Core] Failover through different instance types with same GPU #3193

Closed

mjibril force-pushed the fluidstack-provisioner branch from cdb419f to fc2ae25 Compare February 19, 2024 21:47

[FluidStack] PR Changes: Reformat files

ca493e8

* Reformat after pull from skypilot/master * Implement multi-node support for FluidStack * Use CUDA installation on plain distro image * Removed `check_disk_tier_enabled` from `fluidstack.py`

mjibril force-pushed the fluidstack-provisioner branch from fc2ae25 to ca493e8 Compare February 20, 2024 08:36

Michaelvll merged commit 3ba7032 into skypilot-org:master Feb 20, 2024
19 checks passed

This was referenced Mar 5, 2024

Fluidstack integration #2492

Closed

[Cloud] Support vast.ai #2930

Open

	# Command to start ray clusters are now placed in `sky.provision.instance_setup`.
	# We do not need to list it here anymore.

	(generic_cloud, 20 * 60), # 20 mins
	timeout=_get_timeout(generic_cloud, 20 * 60), # 20 mins

		'FluidStack cloud does not support'
		' stopping VMs. for all DCs',

	clouds.CloudImplementationFeatures.DOCKER_IMAGE:
	clouds.CloudImplementationFeatures.IMAGE_ID: f'Specifying image ID is not supported for {_REPR}.',
	clouds.CloudImplementationFeatures.DOCKER_IMAGE:

	# TODO(ewzeng): Implement get_current_user_identity for Fluidstack
	# TODO: Implement get_current_user_identity for Fluidstack

		sudo apt update;
		sudo apt-get -y install libcudnn8=8.8.0.121-1+cuda11.8 libcudnn8-dev=8.8.0.121-1+cuda11.8 libnccl2=2.16.5-1+cuda11.8 libnccl-dev=2.16.5-1+cuda11.8 cuda-toolkit-11-8 cuda-11-8 build-essential devscripts debhelper fakeroot cuda-drivers=520.61.05-1 python3-pip cuda-nvcc-11-8 linux-generic-hwe-20.04 tensorrt=8.5.3.1-1+cuda11.8 libnvinfer8=8.5.3-1+cuda11.8 libnvinfer-plugin8=8.5.3-1+cuda11.8 libnvparsers8=8.5.3-1+cuda11.8 libnvonnxparsers8=8.5.3-1+cuda11.8 libnvinfer-bin=8.5.3-1+cuda11.8 libnvinfer-dev=8.5.3-1+cuda11.8 libnvinfer-plugin-dev=8.5.3-1+cuda11.8 libnvparsers-dev=8.5.3-1+cuda11.8 libnvonnxparsers-dev=8.5.3-1+cuda11.8 libnvinfer-samples=8.5.3-1+cuda11.8;

[FluidStack] Add FluidStack Integration (Provisioner Interface) #3086

[FluidStack] Add FluidStack Integration (Provisioner Interface) #3086

Conversation

mjibril commented Feb 5, 2024

Michaelvll left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Michaelvll Feb 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Michaelvll left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Michaelvll Feb 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Michaelvll Feb 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Michaelvll left a comment

Choose a reason for hiding this comment

Michaelvll Feb 6, 2024 •

edited

Loading

Michaelvll Feb 16, 2024 •

edited

Loading

Michaelvll Feb 16, 2024 •

edited

Loading