Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding VMware platforms support such as vSphere to Ray Autoscaler #37815

Merged
merged 22 commits into from
Aug 16, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
322a15e
Support Ray on vSphere
Shubhamurkade Jul 26, 2023
2f2ebf6
update the test-script to support test ray-on-vsphere
Aug 1, 2023
39b0fa2
Follow a consistent programming style
Aug 2, 2023
aa48046
make the code consistent with our pr to Ray and redo lint
JingChen23 Aug 4, 2023
aebaac1
Merge pull request #1 from huchen2021/support-test-script-vsphere
Shubhamurkade Aug 4, 2023
bee3086
Merge pull request #2 from JingChen23/address-pr-comments
Shubhamurkade Aug 4, 2023
8ad836b
Remove logs from non_terminated_nodes
Shubhamurkade Aug 6, 2023
06294a0
Merge upstream branch
Shubhamurkade Aug 6, 2023
6149258
Add ARCHITECTURE.md file
Shubhamurkade Aug 7, 2023
40214ea
fix issues
JingChen23 Aug 8, 2023
20b9333
Merge pull request #3 from JingChen23/fix-issues
Shubhamurkade Aug 9, 2023
f06499e
address the review comments
JingChen23 Aug 10, 2023
0454f51
Merge pull request #4 from JingChen23/address-pr-comments-2
Shubhamurkade Aug 10, 2023
4f11f74
Update architecture doc as per review comments
Shubhamurkade Aug 10, 2023
ad95eec
Optimize the unit test, add the ci, and address comments (#6)
JingChen23 Aug 14, 2023
003fed8
fix path error on the Build file
JingChen23 Aug 14, 2023
4f13ff6
add vsphere automation sdk in to the requirements for test
JingChen23 Aug 15, 2023
a8b2bc2
add vsphere automation sdk in to the requirements for test
JingChen23 Aug 15, 2023
68dff61
Bypass the windows tests and fix one little issue
JingChen23 Aug 15, 2023
4fc94a6
Merge branch 'master' into ray-vmware-support
JingChen23 Aug 16, 2023
89700de
fix the test name in BUILD to be the same in ci.sh
JingChen23 Aug 16, 2023
c80dbd6
fix the test name in ci.sh
JingChen23 Aug 16, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 39 additions & 0 deletions doc/source/cluster/vms/getting-started.rst
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,12 @@ Before we start, you will need to install some Python dependencies as follows:

$ pip install -U "ray[default]" google-api-python-client

.. tab-item:: vSphere

.. code-block:: shell

$ pip install vsphere-automation-sdk-python
architkulkarni marked this conversation as resolved.
Show resolved Hide resolved

Comment on lines +54 to +59
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.. tab-item:: vSphere
.. code-block:: shell
$ pip install vsphere-automation-sdk-python
.. tab-item:: vSphere (Experimental)
.. code-block:: shell
$ pip install -U "ray[default]" vsphere-automation-sdk-python

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would opt for labeling this as experimental for now until the feature sees more users and becomes more stable.

Next, if you're not set up to use your cloud provider from the command line, you'll have to configure your credentials:

.. tab-set::
Expand All @@ -67,6 +73,14 @@ Next, if you're not set up to use your cloud provider from the command line, you

Set the ``GOOGLE_APPLICATION_CREDENTIALS`` environment variable as described in `the GCP docs <https://cloud.google.com/docs/authentication/getting-started>`_.

.. tab-item:: vSphere

.. code-block:: shell

$ export VSPHERE_SERVER = 192.168.0.1 # Enter your vSphere IP
$ export VSPHERE_USER = user # Enter your user name
$ export VSPHERE_PASSWORD = password # Enter your password

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to link to relevant vSphere documentation here, similar to how we link to GCP docs for credentials above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @architkulkarni There's no standard way to provide credentials to vSphere. Our code just expects the user to provide these credentials as env variables.

Create a (basic) Python application
-----------------------------------

Expand Down Expand Up @@ -200,6 +214,31 @@ A minimal sample cluster configuration file looks as follows:
type: gcp
region: us-west1

.. tab-item:: vSphere

.. code-block:: yaml

# A unique identifier for the head node and workers of this cluster.
cluster_name: minimal

# Cloud-provider specific configuration.
provider:
type: vsphere

auth:
ssh_user: ray # The VMs are initialised with an user called ray.

available_node_types:
ray.head.default:
node_config:
resource_pool: ray # Resource pool where the Ray cluster will get created
library_item: ray-head-debian # OVF file name from which the head will be created

worker:
node_config:
clone: True # If True, all the workers will be instant-cloned from a frozen VM
library_item: ray-frozen-debian # The OVF file from which a frozen VM will be created
architkulkarni marked this conversation as resolved.
Show resolved Hide resolved

Save this configuration file as ``config.yaml``. You can specify a lot more details in the configuration file: instance types to use, minimum and maximum number of workers to start, autoscaling strategy, files to sync, and more. For a full reference on the available configuration properties, please refer to the :ref:`cluster YAML configuration options reference <cluster-config>`.

After defining our configuration, we will use the Ray cluster launcher to start a cluster on the cloud, creating a designated "head node" and worker nodes. To start the Ray cluster, we will use the :ref:`Ray CLI <ray-cluster-cli>`. Run the following command:
Expand Down
5 changes: 5 additions & 0 deletions python/ray/autoscaler/_private/command_runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -710,6 +710,11 @@ def run_init(
):
BOOTSTRAP_MOUNTS = ["~/ray_bootstrap_config.yaml", "~/ray_bootstrap_key.pem"]

# TODO: Maybe add another step after Setup say Post-Setup which will
# help in setting up the host after the container comes up.
if "VsphereNodeProvider" in str(type(self.ssh_command_runner.provider)):
BOOTSTRAP_MOUNTS += ["~/ray_bootstrap_public_key.key"]

architkulkarni marked this conversation as resolved.
Show resolved Hide resolved
specific_image = self.docker_config.get(
f"{'head' if as_head else 'worker'}_image", self.docker_config.get("image")
)
Expand Down
15 changes: 15 additions & 0 deletions python/ray/autoscaler/_private/providers.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,12 @@ def _import_azure(provider_config):
return AzureNodeProvider


def _import_vsphere(provider_config):
from ray.autoscaler._private.vsphere.node_provider import VsphereNodeProvider

return VsphereNodeProvider


def _import_local(provider_config):
if "coordinator_address" in provider_config:
from ray.autoscaler._private.local.coordinator_node_provider import (
Expand Down Expand Up @@ -122,6 +128,12 @@ def _load_aws_defaults_config():
return os.path.join(os.path.dirname(ray_aws.__file__), "defaults.yaml")


def _load_vsphere_defaults_config():
import ray.autoscaler.vsphere as ray_vsphere

return os.path.join(os.path.dirname(ray_vsphere.__file__), "defaults.yaml")


def _load_gcp_defaults_config():
import ray.autoscaler.gcp as ray_gcp

Expand Down Expand Up @@ -152,6 +164,7 @@ def _import_external(provider_config):
"readonly": _import_readonly,
"aws": _import_aws,
"gcp": _import_gcp,
"vsphere": _import_vsphere,
"azure": _import_azure,
"kubernetes": _import_kubernetes,
"kuberay": _import_kuberay,
Expand All @@ -171,6 +184,7 @@ def _import_external(provider_config):
"kuberay": "Kuberay",
"aliyun": "Aliyun",
"external": "External",
"vsphere": "vSphere",
}

_DEFAULT_CONFIGS = {
Expand All @@ -181,6 +195,7 @@ def _import_external(provider_config):
"azure": _load_azure_defaults_config,
"aliyun": _load_aliyun_defaults_config,
"kubernetes": _load_kubernetes_defaults_config,
"vsphere": _load_vsphere_defaults_config,
}


Expand Down
Empty file.
224 changes: 224 additions & 0 deletions python/ray/autoscaler/_private/vsphere/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,224 @@
import copy
import logging
import os

from cryptography.hazmat.backends import default_backend as crypto_default_backend
from cryptography.hazmat.primitives import serialization as crypto_serialization
from cryptography.hazmat.primitives.asymmetric import rsa

from ray.autoscaler._private.event_system import CreateClusterEvent, global_event_system
from ray.autoscaler._private.util import check_legacy_fields

PRIVATE_KEY_NAME = "ray-bootstrap-key"
PRIVATE_KEY_NAME_EXTN = "{}.pem".format(PRIVATE_KEY_NAME)

PUBLIC_KEY_NAME = "ray_bootstrap_public_key"
PUBLIC_KEY_NAME_EXTN = "{}.key".format(PUBLIC_KEY_NAME)

PRIVATE_KEY_PATH = os.path.expanduser("~/{}.pem".format(PRIVATE_KEY_NAME))
PUBLIC_KEY_PATH = os.path.expanduser("~/{}.key".format(PUBLIC_KEY_NAME))

USER_DATA_FILE_PATH = os.path.join(os.path.dirname(__file__), "./data/userdata.yaml")

logger = logging.getLogger(__name__)


def bootstrap_vsphere(config):
# create a copy of the input config to modify
config = copy.deepcopy(config)

add_credentials_into_provider_section(config)
# Update library item configs
update_vsphere_configs(config)

# Log warnings if user included deprecated `head_node` or `worker_nodes`
# fields. Raise error if no `available_node_types`
check_legacy_fields(config)

# Create new key pair if doesn't exist already
create_key_pair()

# Configure SSH access, using an existing key pair if possible.
config = configure_key_pair(config)

global_event_system.execute_callback(
CreateClusterEvent.ssh_keypair_downloaded,
{"ssh_key_path": config["auth"]["ssh_private_key"]},
)

return config


def add_credentials_into_provider_section(config):

provider_config = config["provider"]

# vsphere_config is an optional field as the credentials can also be specified
# as env variables so first check verifies if this field is present before
# accessing its properties
if (
"vsphere_config" in provider_config
and "credentials" in provider_config["vsphere_config"]
):
return

env_credentials = {}
env_credentials["server"] = os.environ["VSPHERE_SERVER"]
env_credentials["user"] = os.environ["VSPHERE_USER"]
env_credentials["password"] = os.environ["VSPHERE_PASSWORD"]

provider_config["vsphere_config"] = {}
provider_config["vsphere_config"]["credentials"] = env_credentials


def update_vsphere_configs(config):
"""Worker node_config:
If clone:False or unspecified:
If library_item specified:
Create worker from the library_item
If library_item unspecified
Create worker from head node's library_item
If clone:True
If library_item unspecified:
Terminate
If library_item specified:
A frozen VM is created from the library item
Remaining workers are created from the created frozen VM
"""
available_node_types = config["available_node_types"]

# Fetch worker: field from the YAML file
worker_node = available_node_types["worker"]
worker_node_config = worker_node["node_config"]

# Fetch the head node field name from head_node_type field.
head_node_type = config["head_node_type"]

# Use head_node_type field's value to fetch the head node field
head_node = available_node_types[head_node_type]
head_node_config = head_node["node_config"]

# A mandatory constraint enforced by the Ray's YAML validator
# is to add resources field for both head and worker nodes.
# For example, to specify resources for the worker the
# user will specify it in
# worker:
# resources
# We copy that resources field into
# worker:
# node_config:
# resources
# This enables us to access the field during node creation.
# The same happens for head node too.
worker_node_config["resources"] = worker_node["resources"]
head_node_config["resources"] = head_node["resources"]

# by default create worker nodes in the head node's resource pool
worker_resource_pool = head_node_config["resource_pool"]

# If different resource pool is provided for worker nodes, use it
if "resource_pool" in worker_node_config and worker_node_config["resource_pool"]:
worker_resource_pool = worker_node_config["resource_pool"]

worker_node_config["resource_pool"] = worker_resource_pool

worker_networks = None
worker_datastore = None

if "networks" in head_node_config and head_node_config["networks"]:
worker_networks = head_node_config["networks"]

if "networks" in worker_node_config and worker_node_config["networks"]:
worker_networks = worker_node_config["networks"]

worker_node_config["networks"] = worker_networks

if "datastore" in head_node_config and head_node_config["datastore"]:
worker_datastore = head_node_config["datastore"]

if "datastore" in worker_node_config and worker_node_config["datastore"]:
worker_datastore = worker_node_config["datastore"]

worker_node_config["datastore"] = worker_datastore

if "clone" in worker_node_config and worker_node_config["clone"] == True:
if "library_item" not in worker_node_config:
raise ValueError(
"library_item is mandatory if clone:True is set for worker config"
)

worker_library_item = worker_node_config["library_item"]

# Create a new object with properties to be used while creating the frozen VM.
freeze_vm = {
"library_item": worker_library_item,
"resource_pool": worker_resource_pool,
"resources": worker_node_config["resources"],
"networks": worker_networks,
"datastore": worker_datastore,
}

# Add the newly created feeze_vm object to head_node config
head_node_config["freeze_vm"] = freeze_vm

elif "clone" not in worker_node_config or worker_node_config["clone"] == False:
if "library_item" not in worker_node_config:
worker_node_config["library_item"] = head_node_config["library_item"]


def create_key_pair():

# If the files already exists, we don't want to create new keys.
# This if condition will currently pass even if there are invalid keys
# in those path. TODO: Only return if the keys are valid.

if os.path.exists(PRIVATE_KEY_PATH) and os.path.exists(PUBLIC_KEY_PATH):
logger.info("Key-pair already exist. Not creating new ones")
return

# Generate keys
key = rsa.generate_private_key(
backend=crypto_default_backend(), public_exponent=65537, key_size=2048
)

private_key = key.private_bytes(
crypto_serialization.Encoding.PEM,
crypto_serialization.PrivateFormat.PKCS8,
crypto_serialization.NoEncryption(),
)

public_key = key.public_key().public_bytes(
crypto_serialization.Encoding.OpenSSH, crypto_serialization.PublicFormat.OpenSSH
)

with open(PRIVATE_KEY_PATH, "wb") as content_file:
content_file.write(private_key)
os.chmod(PRIVATE_KEY_PATH, 0o600)

with open(PUBLIC_KEY_PATH, "wb") as content_file:
content_file.write(public_key)


def configure_key_pair(config):

logger.info("Configure key pairs for copying into the head node.")

assert os.path.exists(
PRIVATE_KEY_PATH
), "Private key file at path {} was not found".format(PRIVATE_KEY_PATH)

assert os.path.exists(
PUBLIC_KEY_PATH
), "Public key file at path {} was not found".format(PUBLIC_KEY_PATH)

# updater.py file uses the following config to ssh onto the head node
# Also, copies the file onto the head node
config["auth"]["ssh_private_key"] = PRIVATE_KEY_PATH

# The path where the public key should be copied onto the remote host
public_key_remote_path = "~/{}".format(PUBLIC_KEY_NAME_EXTN)

# Copy the public key to the remote host
config["file_mounts"][public_key_remote_path] = PUBLIC_KEY_PATH

return config
10 changes: 10 additions & 0 deletions python/ray/autoscaler/_private/vsphere/data/userdata.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
#cloud-config
users:
- default
- gecos: Ray
groups: sudo, docker
lock_passwd: false
name: ray
passwd: AdminRay
primary_group: sudo
ssh_authorized_keys:
Loading