coll/tuned: Change the default collective algorithm selection #7730

wckzhang · 2020-05-12T23:51:36Z

The default algorithm selections were out of date and not performing
well. After gathering data from OMPI developers, new default algorithm
decisions were selected for:

allgather
allgatherv
allreduce
alltoall
alltoallv
barrier
bcast
gather
reduce
reduce_scatter_block
reduce_scatter
scatter

Signed-off-by: William Zhang wilzhang@amazon.com

wckzhang · 2020-05-12T23:54:52Z

Current iteration just uses if/else blocks. I originally had it using range switch cases, but that is non standard so I threw that out. In the future, I want to revamp the fixed code to work more like dynamic code in that it checks a data structure to avoid as many branches and checks.

It looks like there needs to be some bugfixes in the dynamic code - there are no checks for non commutative ops, which I would like to see.

wckzhang · 2020-05-12T23:56:47Z

You can find the graphs and averaged data here:
https://drive.google.com/open?id=1MV5E9gN-5tootoWoh62aoXmN0jiWiqh3

wckzhang · 2020-05-13T18:13:34Z

Need to update non commutative/commutative code. Also check if any algs must use power of 2

ompiteam-bot · 2020-05-13T18:13:37Z

Can one of the admins verify this patch?

wckzhang · 2020-05-13T18:20:12Z

Looks like there are no issues with non power of two cases. There is a fallback available in all the algorithms currently in the tuned component.

jjhursey · 2020-05-13T21:05:07Z

ok to test

wckzhang · 2020-05-13T21:05:32Z

Rewrote the non commutative checks, should be good to go now

bosilca

I kind of snoozed during the discussion about how this performance was gathered. So let me ask here, how where these processes placed on the nodes when the experiments were done ?

bosilca · 2020-05-15T18:02:08Z

ompi/mca/coll/tuned/coll_tuned_decision_fixed.c

+            if (total_dsize < 2048) {
+                alg = 2;
+            } else if (total_dsize < 16384) {
+                alg = 1;


I understand there is data to back this up, but I just can’t wrap my mind around the claim that An allreduce implemented as a reduce followed by a bcast Can have better performance than a recursive doubling implementation. We are talking about at least twice the latency, so even if the data is small, when you have 256 processes this should count for something.

There's only one size where it outperforms:
From the data, it seems like recursive doubling doesn't do well until larger sizes
0 - current defaults
1 - linear
2 - non overlapping
3 - recursive doubling

I can't say I understand why this is the case, but I have the data for it.

Allreduce #Nnodes Message_size 0 1 2 3 4 5 6 128 4 2913.64(4878.65) 1344.74(3104.49) 1259.40(3147.54) 3230.74(5481.80) 3002.00(4954.51) 2857.97(4782.51) 2826.67(4517.93) 2913.64(4878.65) 128 8 2899.49(4864.18) 1316.23(3038.82) 1184.93(2940.13) 3150.57(5384.97) 3071.26(5012.43) 2879.02(4781.92) 2811.67(4574.71) 2899.49(4864.18) 128 16 2866.31(4821.69) 1312.63(3046.77) 1182.64(2946.13) 3097.51(5284.69) 2958.97(4890.04) 2872.01(4829.44) 2810.05(4509.48) 2866.31(4821.69) 128 32 2914.33(4893.33) 1308.55(3047.86) 1188.79(2947.79) 3155.50(5369.73) 2982.11(4939.97) 2580.43(4431.11) 2748.29(4489.69) 2914.33(4893.33) 128 64 2905.83(4882.22) 1316.48(3050.56) 1184.62(2949.98) 3047.34(5201.51) 3024.41(5004.69) 2202.25(4277.14) 2673.65(4290.02) 2905.83(4882.22) 128 128 2915.41(4892.76) 1326.50(3045.12) 1182.40(2937.46) 2886.84(4847.97) 2928.42(4899.58) 2143.79(4226.11) 1998.89(3815.83) 2915.41(4892.76) 128 256 2713.00(4612.54) 1334.70(3033.14) 1188.84(2936.60) 2867.93(4816.76) 3015.24(5060.37) 2184.14(4284.24) 2181.36(4051.38) 2713.00(4612.54) 128 512 3149.83(6395.99) 1380.31(3043.26) 1187.99(2936.51) 3485.17(5908.56) 13889.51(20102.06) 10352.55(16418.76) 4567.20(9209.34) 3149.83(6395.99) 128 1024 3152.68(6385.89) 1418.47(3044.78) 1205.74(2950.83) 3540.21(5987.08) 14348.81(21068.35) 9486.46(15085.37) 4549.55(9041.11) 3152.68(6385.89) 128 2048 3200.28(6455.68) 1476.17(3038.72) 3643.84(11269.21) 2918.98(5652.02) 15048.90(22636.64) 9004.92(18176.30) 5113.47(10475.57) 3200.28(6455.68) 128 4096 3295.32(6555.58) 1679.44(3024.81) 6412.64(21712.51) 2848.15(5647.26) 13569.57(22881.78) 8980.50(18174.79) 5142.45(10647.91) 3295.32(6555.58) 128 8192 3338.42(6519.30) 2100.74(3086.32) 2568.41(6543.05) 3011.61(5837.56) 13309.64(21974.66) 9287.97(18097.78) 5775.48(11949.52) 3338.42(6519.30) 128 16384 11371.09(19503.20) 3747.43(3197.73) 1222.98(2309.60) 3065.43(5546.04) 13403.42(21779.87) 9568.73(17933.32) 5752.29(11851.30) 11371.09(19503.20) 128 32768 9273.18(18127.16) 8865.84(11979.62) 1805.81(2428.45) 3486.90(5929.51) 10867.88(18496.86) 9302.34(18080.09) 6026.58(12574.90) 9273.18(18127.16) 128 65536 9094.06(18570.70) 15196.44(16023.66) 4304.25(3354.22) 4532.85(7355.35) 8893.91(17602.65) 9359.78(18103.13) 6373.04(13157.22) 9094.06(18570.70) 128 131072 9248.30(18648.35) 21599.01(19537.89) 7298.66(4853.51) 6158.44(8748.26) 9256.05(18023.50) 10528.02(18082.28) 6604.47(13622.15) 9248.30(18648.35) 128 262144 10061.86(18937.07) 41759.74(44516.10) 11518.40(10318.94) 10147.57(14956.15) 12548.72(21517.08) 11036.23(18010.11) 6933.89(13948.64) 10061.86(18937.07) 128 524288 154569.99(411382.35) 66071.82(46407.64) 13104.86(16139.88) 17745.35(22907.08) 73804.77(209112.42) 169329.06(434459.12) 8801.50(17421.10) 154569.99(411382.35) 128 1048576 77388.83(217694.92) 126609.06(95737.94) 24363.44(33186.15) 33791.94(40007.01) 30265.35(68950.84) 102546.42(366719.82) 12353.21(21203.30) 77388.83(217694.92)

Oh a small note, I am using similar logic to how the dynamic code works (For an unknown rank #, use the next smallest known rank #). The checks here select the previous smallest data value, so since we have data for 128 and 256, we use the 128 rank data for 128-255, 256 data for 256-511, etc.

There is probably a better way to assign the data to ranks instead of ^this method.

Could do some midpoint logic...ie

If we have data for 32, 64, 128 nodes.

Data from 64 nodes would represent [(32 + 64)/2 , (128 + 64)/2) or [48, 96)

if (rank < 48) { use 32 rank data } else if (rank < 96) { use 64 rank data } else { use 128 rank data }

wckzhang · 2020-05-15T19:32:11Z

I kind of snoozed during the discussion about how this performance was gathered. So let me ask here, how where these processes placed on the nodes when the experiments were done ?

I didn't particularly ask for anyone to place processes with any particular pattern. People selected different ways to submit the jobs. One dataset I had utilized "--map-by ppr:$MAX_NUM_CORES:node --map-by core" and fully subscribed each node. However, most of the datasets I received just ran an mpi command like:

mpirun --np 4 --mca coll_tuned_use_dynamic_rules 1 --mca coll_tuned_allgather_algorithm 0 osu_allgather

naughtont3 · 2020-05-15T20:28:58Z

One dataset I had utilized "--map-by ppr:$MAX_NUM_CORES:node --map-by core" and fully subscribed each node.

That's the data I submitted and the binding was by core --bind-to core minor typo in above comment.

bosilca · 2020-05-15T22:21:04Z

In this case I am afraid the new decision function are based on a suboptimal use of the network, totally dependent on the process placement.

On the good side we might now have an explanation on the discrepancy I noticed on the linear vs. recursive doubling, as their communication scheme will be different, more across inter-node network vs. more across the intra-node. On the bad side, a different process placement will lead to a drastically different collective performance results, and thus a wildly different decision logic.

So, let me propose something here. The new module I PR'ed few days ago, #7735, can provide the logic to split the collective into inter and intra node, and could use tuned for each one of these. So, if we change the decision functions in tuned using one process per node, and then use shared for intra-node collectives, we could obtain a highly efficient, process placement agnostic, collective framework. We will however need to redo these experiments with a single process per node. We can talk about this more in details at the next call.

wckzhang · 2020-05-15T22:56:37Z

In this case I am afraid the new decision function are based on a suboptimal use of the network, totally dependent on the process placement.

On the good side we might now have an explanation on the discrepancy I noticed on the linear vs. recursive doubling, as their communication scheme will be different, more across inter-node network vs. more across the intra-node. On the bad side, a different process placement will lead to a drastically different collective performance results, and thus a wildly different decision logic.

So, let me propose something here. The new module I PR'ed few days ago, #7735, can provide the logic to split the collective into inter and intra node, and could use tuned for each one of these. So, if we change the decision functions in tuned using one process per node, and then use shared for intra-node collectives, we could obtain a highly efficient, process placement agnostic, collective framework. We will however need to redo these experiments with a single process per node. We can talk about this more in details at the next call.

Hmm, it sounds feasible, but we'd have to re-do data collection. The good news is it wouldn't take very long to do so if we're only doing 1 proc per node. We'd have to agree that the HAN component is going to be the new default component and that the tuned component will be for inter node only. Otherwise I can see some tuning issues arising if HAN isn't used and only tuned is.

Two questions I have are:

The HAN component doesn't seem to implement all our current collectives yet - is this going to be a "work in progress" kind of situation, what is going to be the behavior for the collectives that HAN doesn't support until we fill the gaps?

From a high level can you confirm my understanding - We will use MPI_Comm_split_type(COMM_TYPE_SHARED), which will generate a local comm for each node in the cluster. Then we'll select one rank from each local comm and create an inter node comm. Then for intra node calls, it'll use coll/shared and the intra node comm. For inter node calls, it'll use coll/tuned and the inter node comm.

jladd-mlnx · 2020-05-15T23:04:13Z

@wckzhang would you be able to present a short summary of the findings and new cutoffs on the Tuesday call? I would propose two or three PPT slides with a concise executive summary that is easy to digest.

jladd-mlnx · 2020-05-15T23:04:48Z

@awlauria @gpaulsen - Adding Geoff and Austen for visibility.

wckzhang · 2020-05-15T23:14:34Z

Hmm, I can try writing something up and sharing it on Tuesday. We might have to redo some things if we want to go with George's idea.

jladd-mlnx · 2020-05-16T00:24:48Z

@wckzhang just something super easy to understand for a broad audience; just one or two graphics that show here is where we were without your awesome feature and here is where we are now and it will only get better with more data collection.

jladd-mlnx · 2020-05-16T00:37:17Z

@bosilca I think it's too big an ask to have institutions go back and redo these experiments after they've generously donated so many cycles already. I think we need to carefully analyze the data we have.

wckzhang · 2020-05-16T02:06:14Z

Yeah sure, I can do that @jladd-mlnx

bosilca · 2020-05-25T20:21:19Z

As promised during last week weekly call, here is the process placement impact on collective communication algorithms agnostic to process placement. The results were gathered by Xi on a small cluster at UT (Intel 16x8 cores, IB 100), but the outcome remains consistent across platforms and networks. The labels correspond to OMPI map-by process placement (except random obviously).

Good news, we are not the only implementation having troubles with process placement, plus we have a solution to address it before the 5.0.

rhc54 · 2020-05-26T03:20:04Z

Could you perhaps explain what "latency" means when talking about a collective operation? Just want to ensure I really understand the graphs.

FWIW: we automatically default to by-core mapping if num procs = 2, and by-socket mapping otherwise. I know IMPI does the same. So it would appear to me that IMPI and OMPI (both ADAPT and tuned) are doing pretty well out-of-the-box, yes?

Not trying to get into the argument over the choice of default collective argument - just noting that there is a dimension of the default placement to also consider.

bosilca · 2020-05-26T04:03:10Z

In this context, latency is time to completion measured globally across all participants (or see it as the time to completion on the slowest of the participants).

I would not say they are doing well, it looks like that because the other cases are doing really bad. Being twice as slow, is not, by my standards, a great indication of success. Looking at the bcast, neither IMPI nor OMPI tuned are doing a reasonable job by default. In addition to all this we should also take in account that:

the difference you see here will be only exacerbated at scale as the number of level in the communication tree will keep increasing.
the number of cores in this machine (8) and the number of nodes (16) were nicely fitting into most of the communication trees (both binary and binomial). Most large scale machine nowadays would not fit the same pattern.
most applications are using collective on subcommunicators, so our initial process placement is irrelevant.

The default algorithm selections were out of date and not performing well. After gathering data from OMPI developers, new default algorithm decisions were selected for: allgather allgatherv allreduce alltoall alltoallv barrier bcast gather reduce reduce_scatter_block reduce_scatter scatter These results were gathered using the ompi-collectives-tuning package and then averaged amongst the results gathered from multiple OMPI developers on their clusters. You can access the graphs and averaged data here: https://drive.google.com/drive/folders/1MV5E9gN-5tootoWoh62aoXmN0jiWiqh3 Signed-off-by: William Zhang <wilzhang@amazon.com>

shijin-aws · 2020-07-28T21:51:09Z

ompi/mca/coll/tuned/coll_tuned_decision_fixed.c

+ * The new default fixed decision functions were generated based off of
+ * results that were gathered using the ompi-collectives-tuning package.
+ * These results were submitted by multiple OMPI developers on their clusters
+ * and were subsequently averaged to generate the algorithm switch points


Is it possible to provide more information of the tunings? ( e.g. the node count, instance type, provider etc.), which will make it clear that the tunings can reflect the performance of common user cases?

I'd prefer it remain relatively anonymous, the systems the tests were run on differ greatly and belong to several different companies. The averaged information is all available in the google drive which is described in the commit description.

OK I see, then this PR looks good to me.

wckzhang added the Target: v5.0.x label May 12, 2020

wckzhang force-pushed the newdefaults branch from fe5ecaa to 43246ab Compare May 13, 2020 21:04

bosilca reviewed May 15, 2020

View reviewed changes

wckzhang force-pushed the newdefaults branch 2 times, most recently from a2867a3 to c39165a Compare July 6, 2020 17:55

wckzhang force-pushed the newdefaults branch 2 times, most recently from a08a67c to b3d2b20 Compare July 28, 2020 16:05

wckzhang force-pushed the newdefaults branch from b3d2b20 to ce40cfb Compare July 28, 2020 17:41

wckzhang added the Target: main label Jul 28, 2020

shijin-aws reviewed Jul 28, 2020

View reviewed changes

shijin-aws approved these changes Jul 28, 2020

View reviewed changes

bwbarrett merged commit 41df122 into open-mpi:master Jul 28, 2020

jsquyres mentioned this pull request Aug 3, 2020

coll/tuned: Change the default collective algorithm selection #7952

Merged

coll/tuned: Change the default collective algorithm selection #7730

coll/tuned: Change the default collective algorithm selection #7730

Uh oh!

Conversation

wckzhang commented May 12, 2020

Uh oh!

wckzhang commented May 12, 2020

Uh oh!

wckzhang commented May 12, 2020

Uh oh!

wckzhang commented May 13, 2020

Uh oh!

ompiteam-bot commented May 13, 2020

Uh oh!

wckzhang commented May 13, 2020

Uh oh!

jjhursey commented May 13, 2020

Uh oh!

wckzhang commented May 13, 2020

Uh oh!

bosilca left a comment

Choose a reason for hiding this comment

Uh oh!

bosilca May 15, 2020

Choose a reason for hiding this comment

Uh oh!

wckzhang May 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wckzhang May 15, 2020

Choose a reason for hiding this comment

Uh oh!

wckzhang May 15, 2020

Choose a reason for hiding this comment

Uh oh!

wckzhang commented May 15, 2020

Uh oh!

naughtont3 commented May 15, 2020

Uh oh!

bosilca commented May 15, 2020

Uh oh!

wckzhang commented May 15, 2020

Uh oh!

jladd-mlnx commented May 15, 2020

Uh oh!

jladd-mlnx commented May 15, 2020

Uh oh!

wckzhang commented May 15, 2020

Uh oh!

jladd-mlnx commented May 16, 2020

Uh oh!

jladd-mlnx commented May 16, 2020

Uh oh!

wckzhang commented May 16, 2020

Uh oh!

bosilca commented May 25, 2020

Uh oh!

rhc54 commented May 26, 2020

Uh oh!

bosilca commented May 26, 2020

Uh oh!

shijin-aws Jul 28, 2020

Choose a reason for hiding this comment

Uh oh!

wckzhang Jul 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shijin-aws Jul 28, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wckzhang May 15, 2020 •

edited

Loading

wckzhang Jul 28, 2020 •

edited

Loading