-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: support lazy all_to_all connection setups #22814
Conversation
base/distributed/process_messages.jl
Outdated
t = @async connect_to_peer(cluster_manager, rpid, wconfig) | ||
push!(wait_tasks, t) | ||
if lazy | ||
# The constructor register the object with a global registery. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
registers the object with a global registry
test/distributed_exec.jl
Outdated
i==length(wlist) && continue | ||
@async remotecall_fetch(wl -> asyncmap(q->remotecall_fetch(myid, q), wl), | ||
p, wlist[i+1:end]) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unnecessary blank line
base/distributed/process_messages.jl
Outdated
@@ -317,7 +317,7 @@ function handle_msg(msg::JoinPGRPMsg, header, r_stream, w_stream, version) | |||
|
|||
let rpid=rpid, wconfig=wconfig | |||
if lazy | |||
# The constructor register the object with a global registery. | |||
# The constructor registers the object with a global registery. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
registry is still misspelled
test/topology.jl
Outdated
|
||
# Test for 10 random combinations | ||
wl = workers() | ||
combinations =[] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
combinations = []
test/topology.jl
Outdated
@test num_conns == expected_num_conns | ||
end | ||
|
||
# With lazy=false, all connections ought to be setup initially itself |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are you referring to by "itself" ?
I can try this out on a cluster later and see try the effect adding 500-1000 workers. API wise, I think it would be better to have orthogonal options. Is it the plan to make |
I was planning on doing that, but decided against it for now
Having said that, a separate keyword arg allows us to enable it across all existing and any new topology types in the future if required. |
Merging this tomorrow if there are no objections. |
base/distributed/cluster.jl
Outdated
if isnull(PGRP.lazy) || nprocs() == 1 | ||
PGRP.lazy = Nullable{Bool}(params[:lazy]) | ||
elseif isclusterlazy() != params[:lazy] | ||
throw(ErrorException(string("Active workers with lazy=", isclusterlazy(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably better as an ArgumentError if this comes from a bad keyword argument
We should backport the lazy connection setup behavior onto 0.6 without introducing the Opinions? |
We don't generally backport behavior changes or new features of this size. |
The behavior change in this case would be internal. Things would work as before, the only difference being that first time worker-worker requests will include a connection setup time. This is offset by the benefits on larger clusters in terms of initial setup time as well as resource usage if the workload does not require a complete mesh network. |
Several types are being changed here in ways that would be visible and breaking for any external code that constructs them directly. |
Is there a particularly strong case as to why we would need this change on 0.6 other than that it's a better default behavior? |
The strong case is for larger clusters, say 200 workers and above. This will help in situations where a program run does not require all workers to actually communicate with all other workers, but still requires some inter-worker communications (say between neighbors). Defining a custom topology is not straightforward in the cluster manager and a lazy all_to_all setup does away with the need to do so ahead of time. However, I also realized that this will break interop between 0.6 and 0.6.x and hence is not the right thing to do. I can make a patch available if and when folks start asking for this in 0.6. People really needing this behavior will need to use a manually patched version of Julia. |
Could the backport be implemented by ClusterManagers or some other package? We need to decouple this functionality from the release and compatibility constraints of the Base language and rest of the standard library, the sooner the better. |
No. |
Not an existing package, but this didn't touch anything outside of Distributed, so if that were a package it would be fine. |
It did change core types and the messaging protocol. Will be difficult to do it externally. |
The core types and messaging protocol should be in a package. |
I am strongly in favor of backporting this to 0.6:
|
This PR does the following:
lazy=true
keyword option toaddprocs
. Default is true. Addresses part of Setup worker-worker connections lazily Distributed.jl#42 via this new keyword arg.all_to_all
connection setups.lazy
option is valid only forall_to_all
topologies. For custom topologies, all connections are setup at the time ofaddprocs
.Todo: