rework of cluster manager. support additional launches via a worker #9309

amitmurthy · 2014-12-11T04:40:46Z

This PR does the following:

simplifies the cluster manager interface
starts additional workers at a host from the first worker . Will both speed up and remove the
need for unecessary ssh connections for each worker launch.
has the framework in place to support custom transport mechanisms at a later stage
addprocs can accept an array of hosts / tuples. Each tuple of the type (host, count).
count can be "auto" or :auto, in which case it will launch as many workers as the number
of cores on host . machinefile too supports synatx auto * foo@bar.com for a machine definition.

Consequently, it supersedes #9202 and #9046

amitmurthy · 2014-12-11T04:44:02Z

base/multi.jl

@@ -79,52 +79,28 @@ end

 abstract ClusterManager

-type Worker
-    host::ByteString
-    port::UInt16


host/port information are maintained by the respective cluster managers and "config" dict

amitmurthy · 2014-12-11T04:48:27Z

cc @JeffBezanson

amitmurthy · 2014-12-12T07:02:22Z

I would really like to go ahead and merge this. @JeffBezanson any comments?

ViralBShah · 2014-12-12T13:23:57Z

Any idea why both travis and appveyor are failing?

ViralBShah · 2014-12-12T13:25:40Z

It appears that the test failures are related to this PR.

amitmurthy · 2014-12-12T13:42:24Z

Sorry about that. Fixed.

amitmurthy · 2014-12-12T14:22:42Z

AppVeyor error seems unrelated.

tkelman · 2014-12-12T23:50:36Z

Looks like a fluke, make.exe segfaulted.

JeffBezanson · 2014-12-13T18:04:57Z

base/multi.jl

        end
    end
-    w.bind_addr
+    w.config[:bind_addr]


The w.bind_addr field still exists. Is that intentional?

No, an oversight. Should be removed. Will do so.

JeffBezanson · 2014-12-13T18:38:35Z

Why the switch from passing ClusterManagers to passing types? That forces you to have basically a single global instance of each kind of ClusterManager; functions like manage don't have access to any state that might be stored in a ClusterManager.

amitmurthy · 2014-12-13T18:49:44Z

manage continues to have access to the ClusterManager object (and hence common state) via config[:manager].

The change came because while implementing #9046 , the worker-to-worker connections too will be implemented by the cluster manager. And in the worker, a ClusterManager object may not be appropriate.

If we support user defined transports in the future, custom ClusterManagers will implement their own connect_m2w and connect_w2w, with the default implementations (with tcp sockets) being defined as

connect_m2w{T<:ClusterManager}(::Type{T}, pid::Int, config::Dict) and
connect_w2w{T<:ClusterManager}(::Type{T}, pid::Int, config::Dict)

JeffBezanson · 2014-12-13T18:57:22Z

I don't get it; it seems awkward to have this hidden coupling where T has to be the type of config[:manager]. If config has all the information in it, you could call manage(config, command) etc.

It also might make sense to use types to separate which fields are mandatory, or common to all workers, from manager-specific custom fields.

amitmurthy · 2014-12-13T19:11:13Z

No, we need the type to dispatch to the correct implementation of manage. Each ClusterManager implements its own manage.

config does have all the information about the worker - some fields required by Base and any other worker specific key-value pairs that the ClusterManager may add.

My current thinking is to keep it this way and just document these fields in more detail. The config dict is in effect the interface between code in multi.jl for worker management and ClusterManager implementations.

For example, the launch method for SSHManager adds fields :host, :port, :bind_addr, :count to each worker's config that it launches. This information is used for the subsequent connection setup and launch of additional workers. An MPI cluster manager using MPI for transport may not add :host, :port, :bind_addr, but add a :mpi_id and use that in its own connect_m2w{T<:MPIClusterManager}(::Type{T}, ... implementation.

JeffBezanson · 2014-12-13T19:16:30Z

Ok, we can keep the Dict.

My thinking was that you could write

manage(config::Dict, command) = manage(config[:manager], config, command)

and then we would have normal dispatch on instances. It's very odd to have the object you're actually dispatching on hidden inside a dictionary.

amitmurthy · 2014-12-13T20:02:52Z

OK. I'll revert it to use instances instead. It is more logical.

Also, rethinking now, your idea of replacing config with a type with Nullable fields is not a bad idea at all. One of the fields in this can be userdata::Any which can be set by managers with their own worker specific information. Let me try it that out.

amitmurthy · 2014-12-16T06:47:11Z

Have made changes and rebased as discussed. Will merge this in a few days if no objections are raised.

One of the AppVeyor runs failed. Seems unrelated to these changes.

rework of cluster manager. support additional launches via a worker

ViralBShah · 2014-12-19T06:03:22Z

The improvements deserve a mention in NEWS.md.

JeffBezanson · 2014-12-19T06:12:43Z

Looks good! Great to have steps in place towards using different transports.

habemus-papadum · 2015-02-24T00:49:18Z

Hi -- I'm interested in using the updated SSHManager with the features added by this pull request. I'm currently on 0.3.6 and from what I can tell base/managers.jl is not present in this version (correct me if I am wrong).

If i update my head node to use the current master branch, but leave my ssh nodes to be on 0.3.6, does anyone know if that is likely to work (basically I am looking for an efficient way to start many instances per remote box and take advantage of:

starts additional workers at a host from the first worker . Will both speed up and remove the
need for unecessary ssh connections for each worker launch.

thanks!
(Let me know if I should move this to the issues forum...)

amitmurthy · 2015-02-24T01:08:25Z

Unfortunately, you will need 0.4 on the ssh nodes too. Backporting this to 0.3 (which is on a bugfix/maintenance schedule) is not planned - as it involves changes to the cluster manager interface itself.

habemus-papadum · 2015-02-24T02:01:05Z

Thanks for the info!

amitmurthy mentioned this pull request Dec 11, 2014

Auto launch of additional workers depending on number of cores at a host. #9202

Closed

amitmurthy reviewed Dec 11, 2014
View reviewed changes

JeffBezanson reviewed Dec 13, 2014
View reviewed changes

rework of cluster manager. support additional launches via a worker

e309b8a

amitmurthy added a commit that referenced this pull request Dec 19, 2014

Merge pull request #9309 from amitmurthy/amitm/cman_rework

cc0cbfe

rework of cluster manager. support additional launches via a worker

amitmurthy merged commit cc0cbfe into JuliaLang:master Dec 19, 2014

ViralBShah added the parallelism Parallel or distributed computation label Dec 19, 2014

amitmurthy mentioned this pull request Dec 21, 2014

user defined transports #9434

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rework of cluster manager. support additional launches via a worker #9309

rework of cluster manager. support additional launches via a worker #9309

amitmurthy commented Dec 11, 2014

amitmurthy Dec 11, 2014

amitmurthy commented Dec 11, 2014

amitmurthy commented Dec 12, 2014

ViralBShah commented Dec 12, 2014

ViralBShah commented Dec 12, 2014

amitmurthy commented Dec 12, 2014

amitmurthy commented Dec 12, 2014

tkelman commented Dec 12, 2014

JeffBezanson Dec 13, 2014

amitmurthy Dec 13, 2014

JeffBezanson commented Dec 13, 2014

amitmurthy commented Dec 13, 2014

JeffBezanson commented Dec 13, 2014

amitmurthy commented Dec 13, 2014

JeffBezanson commented Dec 13, 2014

amitmurthy commented Dec 13, 2014

amitmurthy commented Dec 16, 2014

ViralBShah commented Dec 19, 2014

JeffBezanson commented Dec 19, 2014

habemus-papadum commented Feb 24, 2015

amitmurthy commented Feb 24, 2015

habemus-papadum commented Feb 24, 2015

rework of cluster manager. support additional launches via a worker #9309

rework of cluster manager. support additional launches via a worker #9309

Conversation

amitmurthy commented Dec 11, 2014

amitmurthy Dec 11, 2014

Choose a reason for hiding this comment

amitmurthy commented Dec 11, 2014

amitmurthy commented Dec 12, 2014

ViralBShah commented Dec 12, 2014

ViralBShah commented Dec 12, 2014

amitmurthy commented Dec 12, 2014

amitmurthy commented Dec 12, 2014

tkelman commented Dec 12, 2014

JeffBezanson Dec 13, 2014

Choose a reason for hiding this comment

amitmurthy Dec 13, 2014

Choose a reason for hiding this comment

JeffBezanson commented Dec 13, 2014

amitmurthy commented Dec 13, 2014

JeffBezanson commented Dec 13, 2014

amitmurthy commented Dec 13, 2014

JeffBezanson commented Dec 13, 2014

amitmurthy commented Dec 13, 2014

amitmurthy commented Dec 16, 2014

ViralBShah commented Dec 19, 2014

JeffBezanson commented Dec 19, 2014

habemus-papadum commented Feb 24, 2015

amitmurthy commented Feb 24, 2015

habemus-papadum commented Feb 24, 2015