Workers Not Visible to Each Other on Cluster #9951

JaredCrean2 · 2015-01-29T02:26:07Z

When running Julia on a cluster I encountered a problem with some workers not being visible to others. I created a function to read a list of hostnames from a file and pass them to addprocs(). This completes successfully, but inspecting the output from the workers() shows that some workers do not know about others.

Here is the function code:

function createProcs2(fname)
#  get hostnames from file and add them

# get hostnames from file
f = open(fname,"r")
hostnames = readlines(f)        # read all lines from file (vector)
m = length(hostnames)

# add workers
addprocs(hostnames)

# verify
worker_list = workers()
num_workers = length(worker_list)

# print to log
#printToMasterLog("worker list = $worker_list")
#printToMasterLog("num_workers = $num_workers")

return m

end     # end function createProcs()

Then on each worker I run

known_workers = workers()
sort!(known_workers)
num_workers = length(known_workers)

The first 16 workers (corresponding to the first node) all show the expected output:

known workers = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33]
num_workers = 32

But some (not all) of the workers on other nodes show output like this:

known workers = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,24,25,26,27,28,29,30,31,32,33]
num_workers = 29

Which workers cannot see which other workers varies from run to run, but the workers on the first node consistently are able to see all the other workers. I am running julia version 0.4.0-dev+2698 on an Intel x86_64 cluster using Red Hat Enterprise Linux and SLURM as a job management tool.

I tried changing max_parallel = 1, 10, and the number of workers to no avail. Any ideas why this is happening?

The text was updated successfully, but these errors were encountered:

amitmurthy · 2015-01-29T03:36:44Z

Could you also provide the commit of your build?

julia> versioninfo()
Julia Version 0.4.0-dev+2914
Commit 4c3e03b* (2015-01-26 06:17 UTC)
....

The parallel stuff has seen some changes in the recent past and it would be good to know how old a build you are running.

Also, it will be helpful if you could also print out myid() on each of the workers.

Workers with higher pids connect to workers with lower pids (except for pid 1, which initiates connections to all workers). The entire mesh setup may take some time to complete. Do you see the same issue if you run the print statements on workers after some time?

JaredCrean2 · 2015-01-29T04:07:22Z

Thanks for the quick reply.

I added sleep(300) at the end of createProcs2(), and then collected the following output:

Julia Version 0.4.0-dev+2698
Commit 0f9b0c6* (2015-01-14 14:36 UTC)
Platform Info:
  System: Linux (x86_64-unknown-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.3
nothing
        From worker 3:  myid() = 3,  worker_list = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33] , num_workers = 32
        From worker 33: myid() = 33,  worker_list = [2,3,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33] , num_workers = 17
        From worker 21: myid() = 21,  worker_list = [2,3,19,20,21,23,24,25,26,27,28,29,30,31,32,33] , num_workers = 16
        From worker 24: myid() = 24,  worker_list = [2,3,19,20,21,22,23,24,28,29,30,31,32,33] , num_workers = 14
        From worker 25: myid() = 25,  worker_list = [2,3,19,20,21,22,25,27,28,29,30,31,32,33] , num_workers = 14
        From worker 2:  myid() = 2,  worker_list = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33] , num_workers = 32
        From worker 14: myid() = 14,  worker_list = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18] , num_workers = 17
        From worker 16: myid() = 16,  worker_list = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18] , num_workers = 17
        From worker 8:  myid() = 8,  worker_list = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18] , num_workers = 17
        From worker 18: myid() = 18,  worker_list = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18] , num_workers = 17
        From worker 17: myid() = 17,  worker_list = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18] , num_workers = 17
        From worker 11: myid() = 11,  worker_list = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18] , num_workers = 17
        From worker 5:  myid() = 5,  worker_list = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18] , num_workers = 17
        From worker 4:  myid() = 4,  worker_list = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18] , num_workers = 17
        From worker 13: myid() = 13,  worker_list = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18] , num_workers = 17
        From worker 7:  myid() = 7,  worker_list = [2,3,4,5,7,8,9,10,11,12,13,14,15,16,17,18] , num_workers = 16
        From worker 9:  myid() = 9,  worker_list = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18] , num_workers = 17
        From worker 6:  myid() = 6,  worker_list = [2,3,4,5,6,8,9,10,11,12,13,14,15,16,17,18] , num_workers = 16
        From worker 10: myid() = 10,  worker_list = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18] , num_workers = 17
        From worker 28: myid() = 28,  worker_list = [2,3,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33] , num_workers = 17
        From worker 26: myid() = 26,  worker_list = [2,3,19,20,21,22,26,27,28,29,30,31,32,33] , num_workers = 14
        From worker 27: myid() = 27,  worker_list = [2,3,19,20,21,22,25,26,27,28,29,30,31,32,33] , num_workers = 15
        From worker 19: myid() = 19,  worker_list = [2,3,19,20,21,23,24,25,26,27,28,29,30,31,32,33] , num_workers = 16
        From worker 22: myid() = 22,  worker_list = [2,3,22,23,24,25,26,27,28,29,30,31,32,33] , num_workers = 14
        From worker 20: myid() = 20,  worker_list = [2,3,19,20,21,23,24,25,26,27,28,29,30,31,32,33] , num_workers = 16
        From worker 29: myid() = 29,  worker_list = [2,3,19,20,21,22,23,24,25,26,27,28,29,32,33] , num_workers = 15
        From worker 23: myid() = 23,  worker_list = [2,3,19,20,21,22,23,24,28,29,30,31,32,33] , num_workers = 14
        From worker 31: myid() = 31,  worker_list = [2,3,19,20,21,22,23,24,25,26,27,28,30,31,32,33] , num_workers = 16
        From worker 15: myid() = 15,  worker_list = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18] , num_workers = 17
        From worker 12: myid() = 12,  worker_list = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18] , num_workers = 17
        From worker 32: myid() = 32,  worker_list = [2,3,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33] , num_workers = 17
        From worker 30: myid() = 30,  worker_list = [2,3,19,20,21,22,23,24,25,26,27,28,30,31,32,33] , num_workers = 16

It appears the workers on the first node are indeed affected.

amitmurthy · 2015-01-29T04:24:31Z

I confirm that I see the issue on my laptop too. Will look into it.

JaredCrean2 · 2015-01-29T04:31:35Z

Thanks for looking into it, let me know what you find.

fix bug in worker-to-worker connection setup. closes #9951

ViralBShah added the domain:parallelism Parallel or distributed computation label Jan 29, 2015

amitmurthy added the kind:bug Indicates an unexpected problem or unintended behavior label Jan 29, 2015

amitmurthy mentioned this issue Jan 29, 2015

fix bug in worker-to-worker connection setup. closes #9951 #9953

Merged

amitmurthy closed this as completed in #9953 Jan 29, 2015

amitmurthy added a commit that referenced this issue Jan 29, 2015

Merge pull request #9953 from amitmurthy/amitm/9951

7502bde

fix bug in worker-to-worker connection setup. closes #9951

amitmurthy mentioned this issue Jan 31, 2015

more fixes for worker-worker connection setups #9979

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workers Not Visible to Each Other on Cluster #9951

Workers Not Visible to Each Other on Cluster #9951

JaredCrean2 commented Jan 29, 2015

amitmurthy commented Jan 29, 2015

JaredCrean2 commented Jan 29, 2015

amitmurthy commented Jan 29, 2015

JaredCrean2 commented Jan 29, 2015

Workers Not Visible to Each Other on Cluster #9951

Workers Not Visible to Each Other on Cluster #9951

Comments

JaredCrean2 commented Jan 29, 2015

amitmurthy commented Jan 29, 2015

JaredCrean2 commented Jan 29, 2015

amitmurthy commented Jan 29, 2015

JaredCrean2 commented Jan 29, 2015