Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use TBON not ring network for RPCs #689

Merged
merged 16 commits into from
Jun 10, 2016
Merged

Conversation

garlick
Copy link
Member

@garlick garlick commented Jun 8, 2016

RPCs that target a specific nodeid were routed via the ring network, which has latency issues on larger instances. This PR restores the ability to route requests and responses both directions on the TBON (confusing to think about as it is). This time around the commit cf7864b that changes the broker itself is at least fairly minimal.

Routes are calculated based on a static TBON topology, since in this phase we have disabled the "self-healing" stuff. In the future with resilience as well as as grow/shrink, routes will need to be dynamic and probably employ routing tables, or a combination of tables and calculations.

Some additional TBON parameters were exported via attributes (current level, max levels, number of descendants). The flux_get_arity() convenience function was dropped since that parameters is now one of several TBON parameters available as attributes. While users could simply calculate stuff like the level and number of descendants based on a static topology, I thought it best to keep the static calculations localized to the broker so after grow/shrink/heal we won't have to track down other users.

The reduction code was updated to obtain level and max level (used to scale timeouts) from attributes rather than from calculation.

I haven't been able to get time on opal (it is just back up after the power failure) to see if there is any impact on startup. The worst case latencies for single pings has improved dramatically.

The ring network is no more.

@garlick garlick added the review label Jun 8, 2016
@coveralls
Copy link

Coverage Status

Coverage increased (+0.05%) to 74.807% when pulling 8e9884d on garlick:tbon_routing into 71123f5 on flux-framework:master.

@garlick
Copy link
Member Author

garlick commented Jun 8, 2016

This single node test is encouraging!

master:

$ time ./flux start -o,-k8 -s512 /bin/true
[1465425401.914883] broker.crit[55]: child 448 idle for 3 heartbeats
[1465425401.916110] broker.crit[62]: child 502 idle for 3 heartbeats
[1465425401.914883] broker.crit[55]: child 448 idle for 3 heartbeats
[1465425401.923607] broker.crit[56]: child 450 idle for 3 heartbeats
snip - more of the same

real    1m53.189s
user    19m27.640s
sys 2m35.976s

This PR

$ time ./flux start -o,-k8 -s512 /bin/true
[1465425698.649156] broker.crit[61]: child 489 idle for 3 heartbeats
[1465425698.649759] broker.crit[57]: child 461 idle for 3 heartbeats
[1465425698.670401] broker.crit[49]: child 396 idle for 3 heartbeats
snip - more of the same

real    0m26.795s
user    3m20.864s
sys 1m4.736s

@grondo
Copy link
Contributor

grondo commented Jun 8, 2016

Whoa! nice work! I will try to check this out tomorrow but sounds like a
big improvement.

On Wed, Jun 8, 2016 at 3:42 PM, Jim Garlick notifications@github.com
wrote:

This single node test is encouraging!

master:

$ time ./flux start -o,-k8 -s512 /bin/true
[1465425401.914883] broker.crit[55]: child 448 idle for 3 heartbeats
[1465425401.916110] broker.crit[62]: child 502 idle for 3 heartbeats
[1465425401.914883] broker.crit[55]: child 448 idle for 3 heartbeats
[1465425401.923607] broker.crit[56]: child 450 idle for 3 heartbeats
snip - more of the same

real 1m53.189s
user 19m27.640s
sys 2m35.976s

This PR

$ time ./flux start -o,-k8 -s512 /bin/true
[1465425698.649156] broker.crit[61]: child 489 idle for 3 heartbeats
[1465425698.649759] broker.crit[57]: child 461 idle for 3 heartbeats
[1465425698.670401] broker.crit[49]: child 396 idle for 3 heartbeats
snip - more of the same

real 0m26.795s
user 3m20.864s
sys 1m4.736s


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#689 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AAtSUscZDYXfEtjNs_SyrCl75UMwooB2ks5qJ0VPgaJpZM4Ixcox
.

@garlick
Copy link
Member Author

garlick commented Jun 8, 2016

Just looking at flux module list -r all times, which should be similar to commands in rc1

master

$ ./flux start -o,-k8 -s512 
$ time flux module list -r all
Module               Size    Digest  Idle  S  Nodeset
connector-local      2892024 2E356BF    0  R  [0-511]
resource-hwloc       2888824 083D0E9    0  S  [0-511]
job                  2879088 1F1D0F0   22  S  [0-511]
wrexec               2864448 7A874C9   10  S  [0-511]
content-sqlite       2881424 BE155C8    4  S  0
cron                 2982104 B12516C    0  S  0
kvs                  3046216 099185A    0  S  [0-511]
barrier              2875184 17C9197   23  S  [0-511]
mecho                2855248 1CC5385    6  S  [0-511]

real    0m15.947s
user    0m0.040s
sys 0m0.000s

This PR

$ ./flux start -o,-k8 -s512
$ time flux module list -r all
Module               Size    Digest  Idle  S  Nodeset
content-sqlite       2885448 E41075C   17  S  0
kvs                  3050256 CDC6C43    0  S  [0-511]
wrexec               2868472 C7BB285   24  S  [0-511]
resource-hwloc       2892848 3B17F56   17  S  [0-511]
connector-local      2896040 51F0F6B    0  R  [0-511]
barrier              2883304 020D47F   28  S  [0-511]
cron                 2990224 7CC7AA6    0  S  0
job                  2883112 1DA01D5   25  S  [0-511]
mecho                2859272 24F017F   26  S  [0-511]

real    0m0.058s
user    0m0.024s
sys 0m0.008s

274 times faster, he he. I don't know why. Could be an artifact of scheduling all these brokers on a single node - maybe the ring activity creates a context switch nightmare, while the TBON pattern of propagation can get more work done per wakeup.

@garlick
Copy link
Member Author

garlick commented Jun 8, 2016

I was able to sneak in between SWL jobs on opal and grab 64 nodes for a few runs. For some reason I didn't hit issue #683 even with 64 brokers per node (4096 ranks).

These times are comparable to what we saw previously on opal with rc1/rc3 disabled, albeit with a somewhat different allocation. See flux-framework/distribution#13.

Very encouraging.

$ salloc -N 64 --time 0:5 ./runit 8
salloc: Pending job allocation 1012834
salloc: job 1012834 queued and waiting for resources
salloc: job 1012834 has been allocated resources
salloc: Granted job allocation 1012834
nodes 64
tasks per node 8

real    0m2.059s
user    0m0.047s
sys 0m0.076s

salloc -N 64 --time 0:30 ./runit 16
salloc: Pending job allocation 1012836
salloc: job 1012836 queued and waiting for resources
salloc: job 1012836 has been allocated resources
salloc: Granted job allocation 1012836
nodes 64
tasks per node 16

real    0m3.409s
user    0m0.074s
sys 0m0.099s
salloc: Relinquishing job allocation 1012836

$ salloc -N 64 --time 0:30 ./runit 32
salloc: Pending job allocation 1012837
salloc: job 1012837 queued and waiting for resources
salloc: job 1012837 has been allocated resources
salloc: Granted job allocation 1012837
nodes 64
tasks per node 32

real    0m4.576s
user    0m0.140s
sys 0m0.189s
salloc: Relinquishing job allocation 1012837

$ salloc -N 64 --time 0:30 ./runit 64
salloc: Pending job allocation 1012840
salloc: job 1012840 queued and waiting for resources
salloc: job 1012840 has been allocated resources
salloc: Granted job allocation 1012840
nodes 64
tasks per node 64

real    0m9.160s
user    0m0.329s
sys 0m0.449s
salloc: Relinquishing job allocation 1012840

The "runit" script looks like this:

#/bin/bash -e
#
TPN=$1
echo nodes $SLURM_NNODES
echo tasks per node $TPN
time srun --overcommit -N $SLURM_NNODES --ntasks-per-node $TPN flux start /bin/true

@dongahn
Copy link
Member

dongahn commented Jun 8, 2016

Nice!

What are the overall hop count differences between TBON rpc and ring rpc for the same command?

I haven't looked at the implementation but it seems to be a true scalability improvement (not an artifact of running everything on a single node). I may be wrong, but doing an rpc to each and every 511 rank brokers could be something like:

2 (round trip) x (1hop + 2 hops + 3 + ... 511) = 256K hops .... (1)

If we do this over 8-ary TBON, the cost would be reduced to something like:
2 (round trip) x (1x8 + 2x64 + 3*448) = 2960 hops... (2)

(1)/(2) already comes out to be a factor of 88.

Now curious about its multimode improvements :-)

@garlick
Copy link
Member Author

garlick commented Jun 8, 2016

Thanks! Well, it's not that clear to me why this is such a win.

Take the case of flux module load -r all on 1024 ranks. It sends all 1024 requests without waiting for any responses. Then it accepts them in the order they arrive and terminates.

On the ring, each request has to travel a varying distance to get to its destination, then that same distance in reverse to get back. The worst case is 1023 hops and each hop adds about 500 microseconds, so worst case RTT is around 1s. Because of the extreme concurrency I didn't expect to see a lot more than this for the whole batch of 1024 RPCs.

On the k=8 TBON, max depth is 4 so worst case RTT is (500us x 4 x 2) or about 4 milliseconds.

Oh, hmm, that's about a 256x speedup, not far off of what I measured (274x). Maybe the back of the envelope is working here :-)

@garlick
Copy link
Member Author

garlick commented Jun 8, 2016

(sorry for the noise, I had numerous typos in the last message which I just corrected)

@dongahn
Copy link
Member

dongahn commented Jun 9, 2016

I agree the back of the envelope is reasonable. Plus if you look at the overlay at the global level, you end up pumping fewer number of messages and this should reduce network contention as well.

@garlick
Copy link
Member Author

garlick commented Jun 9, 2016

Just posted a change to flux-ping which allows it to take a nodeset and use flux_rpc_multi(). In this mode, (min:mean:max) and stddev of RTT values are reported.

I thought I was going to show that the default TBON k should be increased from 2 but the results are not all that conclusive.

Here are some numbers for 480-rank (30 node) opal instance with k=2

$ flux ping --count 8 all
all!cmb.ping pad=0 seq=0 time=(4.932:21.285:37.256) ms stddev 9.524
all!cmb.ping pad=0 seq=1 time=(6.389:19.826:35.914) ms stddev 9.305
all!cmb.ping pad=0 seq=2 time=(5.941:19.902:36.633) ms stddev 9.557
all!cmb.ping pad=0 seq=3 time=(5.771:21.693:38.835) ms stddev 10.286
all!cmb.ping pad=0 seq=4 time=(8.338:21.104:38.138) ms stddev 9.446
all!cmb.ping pad=0 seq=5 time=(9.933:23.863:40.205) ms stddev 9.864
all!cmb.ping pad=0 seq=6 time=(9.862:22.005:37.733) ms stddev 8.873
all!cmb.ping pad=0 seq=7 time=(10.683:22.049:38.287) ms stddev 8.895

and k=8

all!cmb.ping pad=0 seq=0 time=(6.627:19.649:35.445) ms stddev 8.766
all!cmb.ping pad=0 seq=1 time=(7.754:18.893:34.514) ms stddev 8.551
all!cmb.ping pad=0 seq=2 time=(11.678:21.522:37.626) ms stddev 8.350
all!cmb.ping pad=0 seq=3 time=(9.231:20.035:35.939) ms stddev 8.579
all!cmb.ping pad=0 seq=4 time=(8.669:19.551:35.513) ms stddev 8.568
all!cmb.ping pad=0 seq=5 time=(6.151:18.930:34.276) ms stddev 8.326
all!cmb.ping pad=0 seq=6 time=(7.251:18.478:34.165) ms stddev 8.613
all!cmb.ping pad=0 seq=7 time=(6.034:19.311:35.710) ms stddev 9.269

Here's the same test on a local session

$ ./flux start -s480
$ ./flux ping --count 8 all
all!cmb.ping pad=0 seq=0 time=(4.284:34.006:42.426) ms stddev 9.370
all!cmb.ping pad=0 seq=1 time=(3.228:24.455:41.730) ms stddev 11.722
all!cmb.ping pad=0 seq=2 time=(2.655:29.259:40.855) ms stddev 13.231
all!cmb.ping pad=0 seq=3 time=(4.769:26.413:39.947) ms stddev 10.797
all!cmb.ping pad=0 seq=4 time=(2.860:31.219:43.993) ms stddev 14.165
all!cmb.ping pad=0 seq=5 time=(5.585:27.135:41.134) ms stddev 12.602
all!cmb.ping pad=0 seq=6 time=(2.437:27.978:42.471) ms stddev 12.798
all!cmb.ping pad=0 seq=7 time=(3.559:27.185:39.263) ms stddev 11.012
$ exit
$ ./flux start -s480 -o,--k-ary=8
$ ./flux ping --count 8 all
all!cmb.ping pad=0 seq=0 time=(3.509:16.350:25.037) ms stddev 7.811
all!cmb.ping pad=0 seq=1 time=(3.321:14.243:23.832) ms stddev 7.030
all!cmb.ping pad=0 seq=2 time=(3.284:14.034:23.808) ms stddev 6.037
all!cmb.ping pad=0 seq=3 time=(4.359:14.753:25.901) ms stddev 6.748
all!cmb.ping pad=0 seq=4 time=(3.507:18.029:27.760) ms stddev 8.864
all!cmb.ping pad=0 seq=5 time=(3.672:18.689:26.709) ms stddev 7.013
all!cmb.ping pad=0 seq=6 time=(3.517:17.393:28.467) ms stddev 8.429
all!cmb.ping pad=0 seq=7 time=(2.778:16.134:27.885) ms stddev 8.465

@garlick
Copy link
Member Author

garlick commented Jun 9, 2016

Maybe mrpc/mping ought to go though

local instance size=480

$ ./flux ping all
all!cmb.ping pad=0 seq=0 time=(3.825:14.199:26.649) ms stddev 6.368
all!cmb.ping pad=0 seq=1 time=(3.140:15.008:24.807) ms stddev 7.444
all!cmb.ping pad=0 seq=2 time=(3.304:14.806:26.047) ms stddev 7.594
all!cmb.ping pad=0 seq=3 time=(2.913:15.072:23.902) ms stddev 7.433
all!cmb.ping pad=0 seq=4 time=(2.864:13.511:23.033) ms stddev 6.471
$ ./flux mping 0-479
flux-mping: mecho: pad=0 seq=0 time=204.732 ms
flux-mping: mecho: pad=0 seq=1 time=208.784 ms
flux-mping: mecho: pad=0 seq=2 time=221.829 ms
flux-mping: mecho: pad=0 seq=3 time=174.639 ms
flux-mping: mecho: pad=0 seq=4 time=172.620 ms
flux-mping: mecho: pad=0 seq=5 time=202.535 ms

@grondo
Copy link
Contributor

grondo commented Jun 9, 2016

by "ought to go" you mean be removed?

Possibly.. I don't think there are any current users of the interface (besides mping) except the mecho demo program?

How does the event based ping scale with larger payloads?

@garlick
Copy link
Member Author

garlick commented Jun 9, 2016

Yes I meant be removed.

Note that the above is not event based. It's simply flux_rpc_multi() with routing of (many) requests over the TBON instead of the ring.

Good question about payloads. I just ran some quick tests in the 480 rank session on my desktop of mping versus ping at payload sizes of 8K, 64K, 256K and mping was 8x, 5x, and 15x slower, respectively.

@grondo
Copy link
Contributor

grondo commented Jun 9, 2016

Note that the above is not event based. It's simply flux_rpc_multi() with routing of (many) requests over the TBON instead of the ring.

Somehow I completely missed that! (I blame reading the initial comment on cell phone). That is awesome!

@grondo
Copy link
Contributor

grondo commented Jun 9, 2016

In case it wasn't clear, I feel your new flux_rpc_multi() is vastly superior to mrpc -- since there are not really any users I certainly vote to dump it. ping is basically an echo service as well, correct (?) (copies payload in and out) -- so no need to replace mecho with some other kind of demo service.

Really nice result here. Even better than I initally thought.

@garlick
Copy link
Member Author

garlick commented Jun 9, 2016

Thanks! Ping is more or less the same yes (it does add the route taken by the request to the return payload).

@grondo
Copy link
Contributor

grondo commented Jun 9, 2016

Only suggestion is maybe expand the in-code comment in cf7864b that explains why route is popped twice. It was very clear after reading your excellent commit message, but I fear it is missing a little context for someone just stumbling across the code (could just be me though)

garlick added a commit to garlick/flux-core that referenced this pull request Jun 9, 2016
flux_mrpc() is a prototype "multi-rpc" interface based on the KVS.

The much improved RPC API design in flux/rpc.h includes
flux_rpc_multi(), which performs the same function directly,
without using the KVS, and performs 5-15X faster depending on
payload and session size (as discussed in pr flux-framework#689).

Since the API is inferior, and the dumb design now outperforms
the "smart" scalable design, it's time to retire flux_mrpc().
Further optimizations for scalability should take place behind
the new API.

This chaange also deprecates
- mrpc python bindings
- mrpc lua bindings
- flux-mping
- mecho module
- t1003-mecho.t sharness test
- lua mrpc sharness test

pymod (demo of python comms module) was temporarily taken out of
the modules Makefile.am SUBDIRS pending reimplementation based on
something besides the mrpc python bindings.
@coveralls
Copy link

Coverage Status

Coverage increased (+0.2%) to 74.969% when pulling 1a5473c on garlick:tbon_routing into 71123f5 on flux-framework:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.2%) to 74.945% when pulling ae67d7b on garlick:tbon_routing into 71123f5 on flux-framework:master.

The ring overlay latency is linearly proportional to the distance
betweeen ranks, which can be great in a large instance.  Therefore,
instead of routing all RPCs that target a specific rank via the ring,
use the TBON.

If target rank is a descendant of the current broker rank, send it
"down" the TBON.  Otherwise send it "up".  Since all ranks are
descendants of rank 0, eventually a route will be found.
Routes on the TBON are currently static, so exploit this property
to calculate routes in lieu of maintaining dynamic routing tables.

Tricky:
RPCs accumulate a "route stack" as the request travels towards its
destination.  This stack is unwound to direct the response along the
same path in reverse.  A special direction-sensitive property of the
zeromq DEALER-ROUTER sockets used in the TBON (specifically the ROUTER
socket) is that it pushes peer socket's identity onto route stack when
a message is travelling "up" towards the root, and pops an identity off
the stack when a message is travelling "down" away from the root.
The popped identity select the peer branch.
See also:  http://api.zeromq.org/4-1:zmq-socket

When responses are routed "up", the ROUTER behavior must be subverted on
the receiving end by popping two frames off of the stack and discarding.
When requests are routed "down", the ROUTER behavior must be subverted on
the sending end by pushing the identity of the sender, followed by the
identity of the peer we want to route to onto the stack.
Add tbon.level, tbon.maxlevel, and tbon.descendants attributes.
Rename the tbon-arity attribute to tbon.arity for uniformity.

These values are potentially useful when implementing reductions.
Although they can be computed elsewhere using the kary convenience
functions, it seemed better to localize these values as broker
attributes so that they can change when TBON routes become dynamic.

Require broker --k-ary option to be > 0.

Use kary class to compute parent id rather than winging it in
the PMI bootstrap code.
Rather than calculating TBON paramters, ask the broker
for them.
API users should use flux_attr_get() to obtain the tbon.arity
attribute.  This convenience wrappers isn't used much and
seems less appropriate given the other TBON attributes now
available.
Add fake attributes for tbon.level and tbon.maxlevel so
flux_reduce_create() can succeed in FLUX_REDUCE_TIMEDFLUSH
mode when the handle is was opened on the loop connector.
Use the RPC abstraction for flux-ping instead of message handlers.
Use flux_rpc_multi() if multiple ranks are specified, and
display statistics on the set of RTT values.
flux_mrpc() is a prototype "multi-rpc" interface based on the KVS.

The much improved RPC API design in flux/rpc.h includes
flux_rpc_multi(), which performs the same function directly,
without using the KVS, and performs 5-15X faster depending on
payload and session size (as discussed in pr flux-framework#689).

Since the API is inferior, and the dumb design now outperforms
the "smart" scalable design, it's time to retire flux_mrpc().
Further optimizations for scalability should take place behind
the new API.

This chaange also deprecates
- mrpc python bindings
- mrpc lua bindings
- flux-mping
- mecho module
- t1003-mecho.t sharness test
- lua mrpc sharness test

pymod (demo of python comms module) was temporarily taken out of
the modules Makefile.am SUBDIRS pending reimplementation based on
something besides the mrpc python bindings.
@garlick
Copy link
Member Author

garlick commented Jun 9, 2016

Just squashed down the incremental changes.

@grondo hopefully comments in e4cfee3 are a little mroe clear.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.2%) to 74.989% when pulling f8163cc on garlick:tbon_routing into 71123f5 on flux-framework:master.

@garlick
Copy link
Member Author

garlick commented Jun 10, 2016

This one might be ready for a merge.

@trws just a heads up - I disabled "pymod" because it uses mrpc, now deprecated. Leaving that one for you to rework later as time permits.

@grondo
Copy link
Contributor

grondo commented Jun 10, 2016

@grondo hopefully comments in e4cfee3 are a little mroe clear.

Yes, great! Merging..

@grondo grondo merged commit 2bde1f8 into flux-framework:master Jun 10, 2016
@grondo grondo removed the review label Jun 10, 2016
@garlick garlick deleted the tbon_routing branch June 10, 2016 14:58
@trws
Copy link
Member

trws commented Jun 10, 2016

Sounds good to me. I'll update pymod to use something else, but since it's an example module more than a functional one it may wait a bit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants