Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding VMware platforms support such as vSphere to Ray Autoscaler #37815

Merged
merged 22 commits into from
Aug 16, 2023

Conversation

Shubhamurkade
Copy link
Contributor

Why are these changes needed?

Add support for ray to run on vSphere

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Shubham Urkade <surkade@vmware.com>
Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Shubhamurkade, thanks for the work you've put into the PR. The PR is quite large and that makes it hard to review thoroughly. It would be really helpful if you could split the PR into a series of smaller PRs. Do you see a good way to split it up?

Comment on lines +54 to +59
.. tab-item:: vSphere

.. code-block:: shell

$ pip install vsphere-automation-sdk-python

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.. tab-item:: vSphere
.. code-block:: shell
$ pip install vsphere-automation-sdk-python
.. tab-item:: vSphere (Experimental)
.. code-block:: shell
$ pip install -U "ray[default]" vsphere-automation-sdk-python

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would opt for labeling this as experimental for now until the feature sees more users and becomes more stable.

$ export VSPHERE_SERVER = 192.168.0.1 # Enter your vSphere IP
$ export VSPHERE_USER = user # Enter your user name
$ export VSPHERE_PASSWORD = password # Enter your password

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to link to relevant vSphere documentation here, similar to how we link to GCP docs for credentials above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @architkulkarni There's no standard way to provide credentials to vSphere. Our code just expects the user to provide these credentials as env variables.

@Shubhamurkade
Copy link
Contributor Author

Hi @Shubhamurkade, thanks for the work you've put into the PR. The PR is quite large and that makes it hard to review thoroughly. It would be really helpful if you could split the PR into a series of smaller PRs. Do you see a good way to split it up?

Hi @architkulkarni The majority of the changes are in python/ray/autoscaler/_private/vsphere directory. These are the minimum subset required to achieve the ray on vSphere support. I could do the following:

  1. Move documentation update to another PR.
  2. Move python/ray/autoscaler/vsphere/defaults.yaml to another PR.
  3. Write a detailed documentation which can explain the code flow in python/ray/autoscaler/_private/vsphere directory.

Please let me know if you would need more support.

@architkulkarni
Copy link
Contributor

architkulkarni commented Jul 27, 2023

I see. In that case, I like your idea 3--would you mind adding some documentation, perhaps in an ARCHITECTURE.md file in the python/ray/autoscaler/_private/vsphere directory? You could emphasize the vSphere-specific parts which might not be as familiar to Ray developers. I didn't find any similar documentation for the other cloud cluster launchers, but here are some examples from other parts of the code in case they're helpful: https://github.com/ray-project/ray/blob/master/python/ray/util/client/ARCHITECTURE.md, https://github.com/ray-project/ray/blob/master/python/ray/runtime_env/ARCHITECTURE.md

No need to split the docs and sample YAML into another PR. I'll review this PR piece by piece but may not have time to review all of it at once.

Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reviewed up to create_node in node_provider.py, only minor comments so far.

doc/source/cluster/vms/getting-started.rst Show resolved Hide resolved
python/ray/autoscaler/_private/command_runner.py Outdated Show resolved Hide resolved
doc/source/cluster/vms/getting-started.rst Show resolved Hide resolved
vsphere_credentials["user"],
vsphere_credentials["password"],
VmwSdkClient.SessionType.UNVERIFIED,
VmwSdkClient.ClientType.PYVMOMI_SDK,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between automation and pyvmomi?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

automation sdk uses REST interface & Pyvmomi is supported by VMware SOAP APIs.
We are currently using APIs from both. We plan to move to either of them once we identify the necessary support.

python/ray/autoscaler/_private/vsphere/node_provider.py Outdated Show resolved Hide resolved
python/ray/autoscaler/_private/vsphere/node_provider.py Outdated Show resolved Hide resolved
python/ray/autoscaler/_private/vsphere/node_provider.py Outdated Show resolved Hide resolved
python/ray/autoscaler/_private/vsphere/node_provider.py Outdated Show resolved Hide resolved
python/ray/autoscaler/_private/vsphere/node_provider.py Outdated Show resolved Hide resolved
python/ray/autoscaler/_private/vsphere/node_provider.py Outdated Show resolved Hide resolved
@architkulkarni
Copy link
Contributor

Thanks for adding the doc, will continue the review shortly!

@JingChen23
Copy link
Contributor

Oh I think you need to add the test to https://github.com/ray-project/ray/blob/master/python/ray/tests/BUILD#L258 for it to run in our CI, would you mind adding it?

Added, thanks for pointing that out.

@krfricke
Copy link
Contributor

Attention: External code changed

This PR changes code that is used or cited in external sources, e.g. blog posts.

Before merging this PR, please make sure that the code in the external sources is still working, and consider updating them to reflect the changes.

The affected files and the external sources are:

* `doc/external/pytorch_tutorials_hyperparameter_tuning_tutorial.py`: https://pytorch.org/tutorials/beginner/hyperparameter_tuning_tutorial.html

It looks like this got triggered from the force-push. It doesn't look like the file was actually changed, so from an external code perspective, we can go ahead.

@architkulkarni
Copy link
Contributor

architkulkarni commented Aug 14, 2023

Any ideas about this build failure? Not sure what could be causing it off the top of my head. https://buildkite.com/ray-project/oss-ci-build-pr/builds/31726#0189f389-1c82-452a-a638-1c0b7a963aec/5460-5485

Happens to several (all?) python tests: https://buildkite.com/ray-project/oss-ci-build-pr/builds/31726#0189f389-1c6b-4ef2-98e4-a1a14c9876e8/1741-1781

@architkulkarni
Copy link
Contributor

@scv119 @rickyyx do either of you plan to review this PR? It needs codeowner approval from ray-core.

py_test(
name = "test_vsphere_node_provider",
size = "small",
srcs = ["gcp/test_vsphere_node_provider.py"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
srcs = ["gcp/test_vsphere_node_provider.py"],
srcs = ["vsphere/test_vsphere_node_provider.py"],

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is causing the build failure

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh gosh, will fix this right now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fxied in the latest commit, thanks for helping us do the troubleshooting!

Signed-off-by: Chen Jing <jingch@vmware.com>
@JingChen23
Copy link
Contributor

Attention: External code changed

This PR changes code that is used or cited in external sources, e.g. blog posts.
Before merging this PR, please make sure that the code in the external sources is still working, and consider updating them to reflect the changes.
The affected files and the external sources are:

* `doc/external/pytorch_tutorials_hyperparameter_tuning_tutorial.py`: https://pytorch.org/tutorials/beginner/hyperparameter_tuning_tutorial.html

It looks like this got triggered from the force-push. It doesn't look like the file was actually changed, so from an external code perspective, we can go ahead.

Yes, we used to have one PR tried to fixed the DCO issue, but we failed, so did a force push to revert, to avoid messing up the commits.

Signed-off-by: Chen Jing <jingch@vmware.com>
Signed-off-by: Chen Jing <jingch@vmware.com>
Signed-off-by: Chen Jing <jingch@vmware.com>
Signed-off-by: Chen Jing <jingch@vmware.com>
Signed-off-by: Chen Jing <jingch@vmware.com>
Signed-off-by: Chen Jing <jingch@vmware.com>
@Shubhamurkade Shubhamurkade changed the title Support Ray on vSphere Adding VMware platforms support such as vSphere to Ray Autoscaler Aug 16, 2023
@architkulkarni
Copy link
Contributor

Failed tests:

  • windows://python/ray/tests:test_autoscaler_e2e flaky on https://flaky-tests.ray.io/ and unrelated
  • //rllib:TestLearnerGroupAsyncUpdate unrelated
  • //rllib:test_memory_leak_ppo unrelated
  • //python/ray/tune:test_trial_scheduler unrelated
  • //doc:external/pytorch_tutorials_hyperparameter_tuning_tutorial unrelated
  • //python/ray/tests:test_client_builder unrelated
  • //python/ray/tests:test_out_of_disk_space unrelated

@architkulkarni architkulkarni added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Aug 16, 2023
@architkulkarni architkulkarni merged commit 4428a3f into ray-project:master Aug 16, 2023
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
…y-project#37815)

---------

Signed-off-by: Shubham Urkade <surkade@vmware.com>
Signed-off-by: Chen Hui <huchen@vmware.com>
Signed-off-by: Chen Jing <jingch@vmware.com>
Co-authored-by: Chen Hui <huchen@vmware.com>
Co-authored-by: Chen Jing <jingch@vmware.com>
Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
architkulkarni pushed a commit that referenced this pull request Sep 7, 2023
Similar to other providers, we add example-minimal.yaml and example-full.yaml to vSphere autoscaler. And we add and refine vSphere related references in the Getting Started guide as well as the cluster configuration reference page, based on the newly added examples.

Why are these changes needed?
In PR #37815 we've added vSphere platform support to Ray Autoscaler. However, the related documents are not sufficient. This follow-up change adds related examples similar to other platforms. The related documents including the getting-started guide as well as the cluster configuration reference also need to be updated to include descriptions specific for vSphere.

We will do another follow-up PR to add a "Launching Ray Clusters on vSphere" user guide at https://docs.ray.io/en/latest/cluster/vms/user-guides/launching-clusters/index.html


Signed-off-by: Fangchi Wang <wfangchi@vmware.com>
architkulkarni pushed a commit to architkulkarni/ray that referenced this pull request Sep 7, 2023
…project#39379)

Similar to other providers, we add example-minimal.yaml and example-full.yaml to vSphere autoscaler. And we add and refine vSphere related references in the Getting Started guide as well as the cluster configuration reference page, based on the newly added examples.

Why are these changes needed?
In PR ray-project#37815 we've added vSphere platform support to Ray Autoscaler. However, the related documents are not sufficient. This follow-up change adds related examples similar to other platforms. The related documents including the getting-started guide as well as the cluster configuration reference also need to be updated to include descriptions specific for vSphere.

We will do another follow-up PR to add a "Launching Ray Clusters on vSphere" user guide at https://docs.ray.io/en/latest/cluster/vms/user-guides/launching-clusters/index.html


Signed-off-by: Fangchi Wang <wfangchi@vmware.com>
GeneDer pushed a commit that referenced this pull request Sep 8, 2023
…) (#39399)

Similar to other providers, we add example-minimal.yaml and example-full.yaml to vSphere autoscaler. And we add and refine vSphere related references in the Getting Started guide as well as the cluster configuration reference page, based on the newly added examples.

Why are these changes needed?
In PR #37815 we've added vSphere platform support to Ray Autoscaler. However, the related documents are not sufficient. This follow-up change adds related examples similar to other platforms. The related documents including the getting-started guide as well as the cluster configuration reference also need to be updated to include descriptions specific for vSphere.

We will do another follow-up PR to add a "Launching Ray Clusters on vSphere" user guide at https://docs.ray.io/en/latest/cluster/vms/user-guides/launching-clusters/index.html

Signed-off-by: Fangchi Wang <wfangchi@vmware.com>
Co-authored-by: Fangchi Wang <wfangchi@vmware.com>
jimthompson5802 pushed a commit to jimthompson5802/ray that referenced this pull request Sep 12, 2023
…project#39379)

Similar to other providers, we add example-minimal.yaml and example-full.yaml to vSphere autoscaler. And we add and refine vSphere related references in the Getting Started guide as well as the cluster configuration reference page, based on the newly added examples.

Why are these changes needed?
In PR ray-project#37815 we've added vSphere platform support to Ray Autoscaler. However, the related documents are not sufficient. This follow-up change adds related examples similar to other platforms. The related documents including the getting-started guide as well as the cluster configuration reference also need to be updated to include descriptions specific for vSphere.

We will do another follow-up PR to add a "Launching Ray Clusters on vSphere" user guide at https://docs.ray.io/en/latest/cluster/vms/user-guides/launching-clusters/index.html

Signed-off-by: Fangchi Wang <wfangchi@vmware.com>
Signed-off-by: Jim Thompson <jimthompson5802@gmail.com>
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
…y-project#37815)

---------

Signed-off-by: Shubham Urkade <surkade@vmware.com>
Signed-off-by: Chen Hui <huchen@vmware.com>
Signed-off-by: Chen Jing <jingch@vmware.com>
Co-authored-by: Chen Hui <huchen@vmware.com>
Co-authored-by: Chen Jing <jingch@vmware.com>
Signed-off-by: Victor <vctr.y.m@example.com>
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
…project#39379)

Similar to other providers, we add example-minimal.yaml and example-full.yaml to vSphere autoscaler. And we add and refine vSphere related references in the Getting Started guide as well as the cluster configuration reference page, based on the newly added examples.

Why are these changes needed?
In PR ray-project#37815 we've added vSphere platform support to Ray Autoscaler. However, the related documents are not sufficient. This follow-up change adds related examples similar to other platforms. The related documents including the getting-started guide as well as the cluster configuration reference also need to be updated to include descriptions specific for vSphere.

We will do another follow-up PR to add a "Launching Ray Clusters on vSphere" user guide at https://docs.ray.io/en/latest/cluster/vms/user-guides/launching-clusters/index.html

Signed-off-by: Fangchi Wang <wfangchi@vmware.com>
Signed-off-by: Victor <vctr.y.m@example.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
external-code-affected tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants