-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workers Not Visible to Each Other on Cluster #9951
Comments
Could you also provide the commit of your build?
The parallel stuff has seen some changes in the recent past and it would be good to know how old a build you are running. Also, it will be helpful if you could also print out Workers with higher pids connect to workers with lower pids (except for pid 1, which initiates connections to all workers). The entire mesh setup may take some time to complete. Do you see the same issue if you run the print statements on workers after some time? |
Thanks for the quick reply. I added sleep(300) at the end of createProcs2(), and then collected the following output:
It appears the workers on the first node are indeed affected. |
I confirm that I see the issue on my laptop too. Will look into it. |
Thanks for looking into it, let me know what you find. |
fix bug in worker-to-worker connection setup. closes #9951
When running Julia on a cluster I encountered a problem with some workers not being visible to others. I created a function to read a list of hostnames from a file and pass them to addprocs(). This completes successfully, but inspecting the output from the workers() shows that some workers do not know about others.
Here is the function code:
Then on each worker I run
The first 16 workers (corresponding to the first node) all show the expected output:
But some (not all) of the workers on other nodes show output like this:
Which workers cannot see which other workers varies from run to run, but the workers on the first node consistently are able to see all the other workers. I am running julia version 0.4.0-dev+2698 on an Intel x86_64 cluster using Red Hat Enterprise Linux and SLURM as a job management tool.
I tried changing max_parallel = 1, 10, and the number of workers to no avail. Any ideas why this is happening?
The text was updated successfully, but these errors were encountered: