Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workers Not Visible to Each Other on Cluster #9951

Closed
JaredCrean2 opened this issue Jan 29, 2015 · 4 comments · Fixed by #9953
Closed

Workers Not Visible to Each Other on Cluster #9951

JaredCrean2 opened this issue Jan 29, 2015 · 4 comments · Fixed by #9953
Labels
domain:parallelism Parallel or distributed computation kind:bug Indicates an unexpected problem or unintended behavior

Comments

@JaredCrean2
Copy link
Contributor

When running Julia on a cluster I encountered a problem with some workers not being visible to others. I created a function to read a list of hostnames from a file and pass them to addprocs(). This completes successfully, but inspecting the output from the workers() shows that some workers do not know about others.

Here is the function code:

function createProcs2(fname)
#  get hostnames from file and add them

# get hostnames from file
f = open(fname,"r")
hostnames = readlines(f)        # read all lines from file (vector)
m = length(hostnames)

# add workers
addprocs(hostnames)

# verify
worker_list = workers()
num_workers = length(worker_list)

# print to log
#printToMasterLog("worker list = $worker_list")
#printToMasterLog("num_workers = $num_workers")

return m

end     # end function createProcs()

Then on each worker I run

known_workers = workers()
sort!(known_workers)
num_workers = length(known_workers)

The first 16 workers (corresponding to the first node) all show the expected output:

known workers = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33]
num_workers = 32

But some (not all) of the workers on other nodes show output like this:

known workers = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,24,25,26,27,28,29,30,31,32,33]
num_workers = 29

Which workers cannot see which other workers varies from run to run, but the workers on the first node consistently are able to see all the other workers. I am running julia version 0.4.0-dev+2698 on an Intel x86_64 cluster using Red Hat Enterprise Linux and SLURM as a job management tool.

I tried changing max_parallel = 1, 10, and the number of workers to no avail. Any ideas why this is happening?

@amitmurthy
Copy link
Contributor

Could you also provide the commit of your build?

julia> versioninfo()
Julia Version 0.4.0-dev+2914
Commit 4c3e03b* (2015-01-26 06:17 UTC)
....

The parallel stuff has seen some changes in the recent past and it would be good to know how old a build you are running.

Also, it will be helpful if you could also print out myid() on each of the workers.

Workers with higher pids connect to workers with lower pids (except for pid 1, which initiates connections to all workers). The entire mesh setup may take some time to complete. Do you see the same issue if you run the print statements on workers after some time?

@JaredCrean2
Copy link
Contributor Author

Thanks for the quick reply.

I added sleep(300) at the end of createProcs2(), and then collected the following output:

Julia Version 0.4.0-dev+2698
Commit 0f9b0c6* (2015-01-14 14:36 UTC)
Platform Info:
  System: Linux (x86_64-unknown-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.3
nothing
        From worker 3:  myid() = 3,  worker_list = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33] , num_workers = 32
        From worker 33: myid() = 33,  worker_list = [2,3,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33] , num_workers = 17
        From worker 21: myid() = 21,  worker_list = [2,3,19,20,21,23,24,25,26,27,28,29,30,31,32,33] , num_workers = 16
        From worker 24: myid() = 24,  worker_list = [2,3,19,20,21,22,23,24,28,29,30,31,32,33] , num_workers = 14
        From worker 25: myid() = 25,  worker_list = [2,3,19,20,21,22,25,27,28,29,30,31,32,33] , num_workers = 14
        From worker 2:  myid() = 2,  worker_list = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33] , num_workers = 32
        From worker 14: myid() = 14,  worker_list = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18] , num_workers = 17
        From worker 16: myid() = 16,  worker_list = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18] , num_workers = 17
        From worker 8:  myid() = 8,  worker_list = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18] , num_workers = 17
        From worker 18: myid() = 18,  worker_list = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18] , num_workers = 17
        From worker 17: myid() = 17,  worker_list = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18] , num_workers = 17
        From worker 11: myid() = 11,  worker_list = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18] , num_workers = 17
        From worker 5:  myid() = 5,  worker_list = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18] , num_workers = 17
        From worker 4:  myid() = 4,  worker_list = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18] , num_workers = 17
        From worker 13: myid() = 13,  worker_list = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18] , num_workers = 17
        From worker 7:  myid() = 7,  worker_list = [2,3,4,5,7,8,9,10,11,12,13,14,15,16,17,18] , num_workers = 16
        From worker 9:  myid() = 9,  worker_list = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18] , num_workers = 17
        From worker 6:  myid() = 6,  worker_list = [2,3,4,5,6,8,9,10,11,12,13,14,15,16,17,18] , num_workers = 16
        From worker 10: myid() = 10,  worker_list = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18] , num_workers = 17
        From worker 28: myid() = 28,  worker_list = [2,3,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33] , num_workers = 17
        From worker 26: myid() = 26,  worker_list = [2,3,19,20,21,22,26,27,28,29,30,31,32,33] , num_workers = 14
        From worker 27: myid() = 27,  worker_list = [2,3,19,20,21,22,25,26,27,28,29,30,31,32,33] , num_workers = 15
        From worker 19: myid() = 19,  worker_list = [2,3,19,20,21,23,24,25,26,27,28,29,30,31,32,33] , num_workers = 16
        From worker 22: myid() = 22,  worker_list = [2,3,22,23,24,25,26,27,28,29,30,31,32,33] , num_workers = 14
        From worker 20: myid() = 20,  worker_list = [2,3,19,20,21,23,24,25,26,27,28,29,30,31,32,33] , num_workers = 16
        From worker 29: myid() = 29,  worker_list = [2,3,19,20,21,22,23,24,25,26,27,28,29,32,33] , num_workers = 15
        From worker 23: myid() = 23,  worker_list = [2,3,19,20,21,22,23,24,28,29,30,31,32,33] , num_workers = 14
        From worker 31: myid() = 31,  worker_list = [2,3,19,20,21,22,23,24,25,26,27,28,30,31,32,33] , num_workers = 16
        From worker 15: myid() = 15,  worker_list = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18] , num_workers = 17
        From worker 12: myid() = 12,  worker_list = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18] , num_workers = 17
        From worker 32: myid() = 32,  worker_list = [2,3,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33] , num_workers = 17
        From worker 30: myid() = 30,  worker_list = [2,3,19,20,21,22,23,24,25,26,27,28,30,31,32,33] , num_workers = 16

It appears the workers on the first node are indeed affected.

@ViralBShah ViralBShah added the domain:parallelism Parallel or distributed computation label Jan 29, 2015
@amitmurthy
Copy link
Contributor

I confirm that I see the issue on my laptop too. Will look into it.

@amitmurthy amitmurthy added the kind:bug Indicates an unexpected problem or unintended behavior label Jan 29, 2015
@JaredCrean2
Copy link
Contributor Author

Thanks for looking into it, let me know what you find.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:parallelism Parallel or distributed computation kind:bug Indicates an unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants