Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to initialize cluster #31

Open
dugar-tarun opened this issue Feb 17, 2023 · 2 comments
Open

Unable to initialize cluster #31

dugar-tarun opened this issue Feb 17, 2023 · 2 comments

Comments

@dugar-tarun
Copy link

I am not able to initialize my cluster for ray using ray-on-aml version 0.2.4. I'm running a notebook in the Python 3.8 AzureML environment. Using the following piece of code:

from ray_on_aml.core import Ray_On_AML

ray_on_aml =Ray_On_AML(ws=ws, compute_cluster ="CC-RayWorker-CPU-DS12-v2")

# May take 7 mintues or longer. Check the AML run under ray_on_aml experiment for cluster status.  
ray = ray_on_aml.getRay(ci_is_head=True, num_node=2,pip_packages=["ray[air]==2.2.0","ray[data]==2.2.0","torch==1.13.0","fastparquet==2022.12.0", "azureml-mlflow==1.48.0", "pyarrow==6.0.1", "dask==2022.12.0", "adlfs==2022.11.2", "fsspec==2022.11.0"])

While the compute instance initializes successfully, the ray_on_aml job fails in the cluster with the following error:

Cleaning up all outstanding Run operations, waiting 300.0 seconds
1 items cleaning up...
Cleanup took 0.2714250087738037 seconds
Traceback (most recent call last):
  File "source_file.py", line 175, in <module>
    startRayMaster()
  File "source_file.py", line 103, in startRayMaster
    ip = socket.gethostbyname(socket.gethostname())
socket.gaierror: [Errno -2] Name or service not known

Retrying due to transient client side error HTTPSConnectionPool(host='westus-0.in.applicationinsights.azure.com', port=443): Max retries exceeded with url: /v2.1/track (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f1ee8697220>: Failed to establish a new connection: [Errno -2] Name or service not known')).
2023-02-16 13:21:17,476	INFO usage_lib.py:516 -- Usage stats collection is enabled by default without user confirmation because this terminal is detected to be non-interactive. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
2023-02-16 13:21:17,476	INFO scripts.py:702 -- Local node IP: 10.62.79.24
2023-02-16 13:21:19,380	SUCC scripts.py:739 -- --------------------
2023-02-16 13:21:19,380	SUCC scripts.py:740 -- Ray runtime started.
2023-02-16 13:21:19,380	SUCC scripts.py:741 -- --------------------
2023-02-16 13:21:19,380	INFO scripts.py:743 -- Next steps
2023-02-16 13:21:19,381	INFO scripts.py:744 -- To connect to this Ray runtime from another node, run
2023-02-16 13:21:19,381	INFO scripts.py:747 --   ray start --address='10.62.79.24:6379'
2023-02-16 13:21:19,381	INFO scripts.py:763 -- Alternatively, use the following Python code:
2023-02-16 13:21:19,381	INFO scripts.py:765 -- import ray
2023-02-16 13:21:19,381	INFO scripts.py:769 -- ray.init(address='auto')
2023-02-16 13:21:19,381	INFO scripts.py:781 -- To connect to this Ray runtime from outside of the cluster, for example to
2023-02-16 13:21:19,381	INFO scripts.py:785 -- connect to a remote cluster from your laptop directly, use the following
2023-02-16 13:21:19,381	INFO scripts.py:789 -- Python code:
2023-02-16 13:21:19,381	INFO scripts.py:791 -- import ray
2023-02-16 13:21:19,381	INFO scripts.py:792 -- ray.init(address='ray://<head_node_ip_address>:10001')
2023-02-16 13:21:19,381	INFO scripts.py:801 -- To see the status of the cluster, use
2023-02-16 13:21:19,381	INFO scripts.py:802 --   ray status
2023-02-16 13:21:19,381	INFO scripts.py:812 -- If connection fails, check your firewall settings and network configuration.
2023-02-16 13:21:19,381	INFO scripts.py:820 -- To terminate the Ray runtime, run
2023-02-16 13:21:19,381	INFO scripts.py:821 --   ray stop

I have this entire setup within a VNet and all the compute resources have been created in the same subnet. Due to certain policies, I am forced to enable 'No Public IP'(npip) on my computes.

Could this be an issue due to my setup - npip or NSG? Or is it something to do with the library? Please help mitigate this.

Thank you

@james-tn
Copy link
Collaborator

yeah, I think it failed probably because of npip policy. That might have prevented the code socket.gethostbyname(socket.gethostname()) from running successfully.
We'll check on the scenario with npip later.
Can you try with the job mode?

@dugar-tarun
Copy link
Author

No luck with job mode either. It errors out at the same line:
socket.gethostbyname(socket.gethostname())
with a message "Name or service not known"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants