Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduling error in ray multi-machine cluster mode #24

Open
flymysql opened this issue Mar 4, 2025 · 3 comments
Open

Scheduling error in ray multi-machine cluster mode #24

flymysql opened this issue Mar 4, 2025 · 3 comments

Comments

@flymysql
Copy link

flymysql commented Mar 4, 2025

When I deploy smallpond on two machines and execute tasks on machine A, when I schedule the task to another machine B, an error is reported that the file path cannot be found.

I checked the file path. This data path is generated when machine A is initialized, but this data path is also used when executing tasks on machine B. The initial data path of machine B should be different from that of machine A.

Image

@wangrunji0408
Copy link
Collaborator

wangrunji0408 commented Mar 5, 2025

You should set a data_root which is accessible to both A and B.

sp = smallpond.init(data_root="shared/path")

In your case it is not set, and the default value is in your home path.

@flymysql
Copy link
Author

flymysql commented Mar 6, 2025

You should set a data_root which is accessible to both A and B.

sp = smallpond.init(data_root="shared/path")
In your case it is not set, and the default value is in your home path.

Well, I have solved this problem, but it seems that data_root needs to be set to the directory where 3FS or HDFS mounts fuse. This ensures that the content of data_root will be synchronized to other machine nodes when a session is initialized.

In fact, other ray machine nodes will not actively create the data_root directory of smallpond, so they need to rely on the distributed file system for synchronization,3FS or other

@miao404
Copy link

miao404 commented Mar 6, 2025

You should set a data_root which is accessible to both A and B.
sp = smallpond.init(data_root="shared/path")
In your case it is not set, and the default value is in your home path.

Well, I have solved this problem, but it seems that data_root needs to be set to the directory where 3FS or HDFS mounts fuse. This ensures that the content of data_root will be synchronized to other machine nodes when a session is initialized.

In fact, other ray machine nodes will not actively create the data_root directory of smallpond, so they need to rely on the distributed file system for synchronization,3FS or other

Hello, I have the same problem. How did you solve it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants