Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with running on Ray Client #352

Open
wctmanager opened this issue Jun 5, 2023 · 3 comments
Open

Issues with running on Ray Client #352

wctmanager opened this issue Jun 5, 2023 · 3 comments

Comments

@wctmanager
Copy link

Current documentation on using RayDP with Ray Client only syas: "RayDP works the same way when using ray client. However, spark driver would be on the local machine." It would be very helpful to have at least one example of how it should be configured and used, because implementing with Ray v2.1.0 and RayDP v1.5.0 (using Azure Kubernetes Service as backend) something straight forward as:
ray.init(address="ray://raycluster-kuberay-head-svc.default.svc.cluster.local:10001")
spark = raydp.init_spark(app_name='RayDP Example',
                         num_executors=1,
                         executor_cores=1,
                         executor_memory='500M')
df = spark.createDataFrame([('look',), ('spark',), ('tutorial',), ('spark',), ('look', ), ('python', )], ['word'])

Initializes Ray Client, RayDP/PySpark and creates the dataset without errors, but then

df.show()

creates an endless stream of 

[Stage 0:> (0 + 0) / 1]
2023-06-04 22:18:13,383 WARN TaskSchedulerImpl [task-starvation-timer]: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
2023-06-04 22:18:28,382 WARN TaskSchedulerImpl [task-starvation-timer]: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
2023-06-04 22:18:43,382 WARN TaskSchedulerImpl [task-starvation-timer]: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

Although an approach with wrapping "init_spark" into a ray actor works fine.
Any advice or example to run remotely with Ray CLient will be highly appreciated.
Thank you.

@kira-lin
Copy link
Collaborator

kira-lin commented Jun 6, 2023

hi,
I was not able to reproduce this issue in my environment. Maybe it's due to the network. As we said in the document, in ray client mode, spark executors will be in the ray cluster, but spark driver will be on the local machine where the script is run. Can that spark driver connect to those executors? Can you inspect the java-worker-*.log in /tmp/ray/session_latest/logs/?

#299 This issue might be related. Are you using Mac for that local machine?

@wctmanager
Copy link
Author

wctmanager commented Jun 8, 2023

Thanks. Right, It's indeed the networking issue (between local spark driver and remote executor). The question is sooner -what are the network requirements to make driver-executor work fine (open ports, something else?) Thank you. P.S. By the way in this particular case both driver and executor are in the same k8s cluster, but different pods and namespaces.

@kira-lin
Copy link
Collaborator

kira-lin commented Jun 9, 2023

I see. The driver node should have access to all ports on the executor nodes. I think this is enough

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants