-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use TBON not ring network for RPCs #689
Conversation
This single node test is encouraging! master:
This PR
|
Whoa! nice work! I will try to check this out tomorrow but sounds like a On Wed, Jun 8, 2016 at 3:42 PM, Jim Garlick notifications@github.com
|
Just looking at master
This PR
274 times faster, he he. I don't know why. Could be an artifact of scheduling all these brokers on a single node - maybe the ring activity creates a context switch nightmare, while the TBON pattern of propagation can get more work done per wakeup. |
I was able to sneak in between SWL jobs on opal and grab 64 nodes for a few runs. For some reason I didn't hit issue #683 even with 64 brokers per node (4096 ranks). These times are comparable to what we saw previously on opal with rc1/rc3 disabled, albeit with a somewhat different allocation. See flux-framework/distribution#13. Very encouraging.
The "runit" script looks like this: #/bin/bash -e
#
TPN=$1
echo nodes $SLURM_NNODES
echo tasks per node $TPN
time srun --overcommit -N $SLURM_NNODES --ntasks-per-node $TPN flux start /bin/true |
Nice! What are the overall hop count differences between TBON rpc and ring rpc for the same command? I haven't looked at the implementation but it seems to be a true scalability improvement (not an artifact of running everything on a single node). I may be wrong, but doing an rpc to each and every 511 rank brokers could be something like: 2 (round trip) x (1hop + 2 hops + 3 + ... 511) = 256K hops .... (1) If we do this over 8-ary TBON, the cost would be reduced to something like: (1)/(2) already comes out to be a factor of 88. Now curious about its multimode improvements :-) |
Thanks! Well, it's not that clear to me why this is such a win. Take the case of On the ring, each request has to travel a varying distance to get to its destination, then that same distance in reverse to get back. The worst case is 1023 hops and each hop adds about 500 microseconds, so worst case RTT is around 1s. Because of the extreme concurrency I didn't expect to see a lot more than this for the whole batch of 1024 RPCs. On the k=8 TBON, max depth is 4 so worst case RTT is (500us x 4 x 2) or about 4 milliseconds. Oh, hmm, that's about a 256x speedup, not far off of what I measured (274x). Maybe the back of the envelope is working here :-) |
(sorry for the noise, I had numerous typos in the last message which I just corrected) |
I agree the back of the envelope is reasonable. Plus if you look at the overlay at the global level, you end up pumping fewer number of messages and this should reduce network contention as well. |
Just posted a change to I thought I was going to show that the default TBON k should be increased from 2 but the results are not all that conclusive. Here are some numbers for 480-rank (30 node) opal instance with k=2
and k=8
Here's the same test on a local session
|
Maybe mrpc/mping ought to go though local instance size=480
|
by "ought to go" you mean be removed? Possibly.. I don't think there are any current users of the interface (besides mping) except the How does the event based ping scale with larger payloads? |
Yes I meant be removed. Note that the above is not event based. It's simply Good question about payloads. I just ran some quick tests in the 480 rank session on my desktop of mping versus ping at payload sizes of 8K, 64K, 256K and mping was 8x, 5x, and 15x slower, respectively. |
Somehow I completely missed that! (I blame reading the initial comment on cell phone). That is awesome! |
In case it wasn't clear, I feel your new Really nice result here. Even better than I initally thought. |
Thanks! Ping is more or less the same yes (it does add the route taken by the request to the return payload). |
Only suggestion is maybe expand the in-code comment in cf7864b that explains why route is popped twice. It was very clear after reading your excellent commit message, but I fear it is missing a little context for someone just stumbling across the code (could just be me though) |
flux_mrpc() is a prototype "multi-rpc" interface based on the KVS. The much improved RPC API design in flux/rpc.h includes flux_rpc_multi(), which performs the same function directly, without using the KVS, and performs 5-15X faster depending on payload and session size (as discussed in pr flux-framework#689). Since the API is inferior, and the dumb design now outperforms the "smart" scalable design, it's time to retire flux_mrpc(). Further optimizations for scalability should take place behind the new API. This chaange also deprecates - mrpc python bindings - mrpc lua bindings - flux-mping - mecho module - t1003-mecho.t sharness test - lua mrpc sharness test pymod (demo of python comms module) was temporarily taken out of the modules Makefile.am SUBDIRS pending reimplementation based on something besides the mrpc python bindings.
The ring overlay latency is linearly proportional to the distance betweeen ranks, which can be great in a large instance. Therefore, instead of routing all RPCs that target a specific rank via the ring, use the TBON. If target rank is a descendant of the current broker rank, send it "down" the TBON. Otherwise send it "up". Since all ranks are descendants of rank 0, eventually a route will be found. Routes on the TBON are currently static, so exploit this property to calculate routes in lieu of maintaining dynamic routing tables. Tricky: RPCs accumulate a "route stack" as the request travels towards its destination. This stack is unwound to direct the response along the same path in reverse. A special direction-sensitive property of the zeromq DEALER-ROUTER sockets used in the TBON (specifically the ROUTER socket) is that it pushes peer socket's identity onto route stack when a message is travelling "up" towards the root, and pops an identity off the stack when a message is travelling "down" away from the root. The popped identity select the peer branch. See also: http://api.zeromq.org/4-1:zmq-socket When responses are routed "up", the ROUTER behavior must be subverted on the receiving end by popping two frames off of the stack and discarding. When requests are routed "down", the ROUTER behavior must be subverted on the sending end by pushing the identity of the sender, followed by the identity of the peer we want to route to onto the stack.
Add tbon.level, tbon.maxlevel, and tbon.descendants attributes. Rename the tbon-arity attribute to tbon.arity for uniformity. These values are potentially useful when implementing reductions. Although they can be computed elsewhere using the kary convenience functions, it seemed better to localize these values as broker attributes so that they can change when TBON routes become dynamic. Require broker --k-ary option to be > 0. Use kary class to compute parent id rather than winging it in the PMI bootstrap code.
Rather than calculating TBON paramters, ask the broker for them.
API users should use flux_attr_get() to obtain the tbon.arity attribute. This convenience wrappers isn't used much and seems less appropriate given the other TBON attributes now available.
Add fake attributes for tbon.level and tbon.maxlevel so flux_reduce_create() can succeed in FLUX_REDUCE_TIMEDFLUSH mode when the handle is was opened on the loop connector.
Use the RPC abstraction for flux-ping instead of message handlers.
Use flux_rpc_multi() if multiple ranks are specified, and display statistics on the set of RTT values.
flux_mrpc() is a prototype "multi-rpc" interface based on the KVS. The much improved RPC API design in flux/rpc.h includes flux_rpc_multi(), which performs the same function directly, without using the KVS, and performs 5-15X faster depending on payload and session size (as discussed in pr flux-framework#689). Since the API is inferior, and the dumb design now outperforms the "smart" scalable design, it's time to retire flux_mrpc(). Further optimizations for scalability should take place behind the new API. This chaange also deprecates - mrpc python bindings - mrpc lua bindings - flux-mping - mecho module - t1003-mecho.t sharness test - lua mrpc sharness test pymod (demo of python comms module) was temporarily taken out of the modules Makefile.am SUBDIRS pending reimplementation based on something besides the mrpc python bindings.
This one might be ready for a merge. @trws just a heads up - I disabled "pymod" because it uses mrpc, now deprecated. Leaving that one for you to rework later as time permits. |
Sounds good to me. I'll update pymod to use something else, but since it's an example module more than a functional one it may wait a bit. |
RPCs that target a specific nodeid were routed via the ring network, which has latency issues on larger instances. This PR restores the ability to route requests and responses both directions on the TBON (confusing to think about as it is). This time around the commit cf7864b that changes the broker itself is at least fairly minimal.
Routes are calculated based on a static TBON topology, since in this phase we have disabled the "self-healing" stuff. In the future with resilience as well as as grow/shrink, routes will need to be dynamic and probably employ routing tables, or a combination of tables and calculations.
Some additional TBON parameters were exported via attributes (current level, max levels, number of descendants). The
flux_get_arity()
convenience function was dropped since that parameters is now one of several TBON parameters available as attributes. While users could simply calculate stuff like the level and number of descendants based on a static topology, I thought it best to keep the static calculations localized to the broker so after grow/shrink/heal we won't have to track down other users.The reduction code was updated to obtain level and max level (used to scale timeouts) from attributes rather than from calculation.
I haven't been able to get time on opal (it is just back up after the power failure) to see if there is any impact on startup. The worst case latencies for single pings has improved dramatically.
The ring network is no more.