-
Notifications
You must be signed in to change notification settings - Fork 895
coll/tuned: Change the default collective algorithm selection #7730
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Current iteration just uses if/else blocks. I originally had it using range switch cases, but that is non standard so I threw that out. In the future, I want to revamp the fixed code to work more like dynamic code in that it checks a data structure to avoid as many branches and checks. It looks like there needs to be some bugfixes in the dynamic code - there are no checks for non commutative ops, which I would like to see. |
You can find the graphs and averaged data here: |
Need to update non commutative/commutative code. Also check if any algs must use power of 2 |
Can one of the admins verify this patch? |
Looks like there are no issues with non power of two cases. There is a fallback available in all the algorithms currently in the tuned component. |
ok to test |
Rewrote the non commutative checks, should be good to go now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kind of snoozed during the discussion about how this performance was gathered. So let me ask here, how where these processes placed on the nodes when the experiments were done ?
if (total_dsize < 2048) { | ||
alg = 2; | ||
} else if (total_dsize < 16384) { | ||
alg = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand there is data to back this up, but I just can’t wrap my mind around the claim that An allreduce implemented as a reduce followed by a bcast Can have better performance than a recursive doubling implementation. We are talking about at least twice the latency, so even if the data is small, when you have 256 processes this should count for something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's only one size where it outperforms:
From the data, it seems like recursive doubling doesn't do well until larger sizes
0 - current defaults
1 - linear
2 - non overlapping
3 - recursive doubling
I can't say I understand why this is the case, but I have the data for it.
Allreduce
#Nnodes Message_size 0 1 2 3 4 5 6
128 4 2913.64(4878.65) 1344.74(3104.49) 1259.40(3147.54) 3230.74(5481.80) 3002.00(4954.51) 2857.97(4782.51) 2826.67(4517.93) 2913.64(4878.65)
128 8 2899.49(4864.18) 1316.23(3038.82) 1184.93(2940.13) 3150.57(5384.97) 3071.26(5012.43) 2879.02(4781.92) 2811.67(4574.71) 2899.49(4864.18)
128 16 2866.31(4821.69) 1312.63(3046.77) 1182.64(2946.13) 3097.51(5284.69) 2958.97(4890.04) 2872.01(4829.44) 2810.05(4509.48) 2866.31(4821.69)
128 32 2914.33(4893.33) 1308.55(3047.86) 1188.79(2947.79) 3155.50(5369.73) 2982.11(4939.97) 2580.43(4431.11) 2748.29(4489.69) 2914.33(4893.33)
128 64 2905.83(4882.22) 1316.48(3050.56) 1184.62(2949.98) 3047.34(5201.51) 3024.41(5004.69) 2202.25(4277.14) 2673.65(4290.02) 2905.83(4882.22)
128 128 2915.41(4892.76) 1326.50(3045.12) 1182.40(2937.46) 2886.84(4847.97) 2928.42(4899.58) 2143.79(4226.11) 1998.89(3815.83) 2915.41(4892.76)
128 256 2713.00(4612.54) 1334.70(3033.14) 1188.84(2936.60) 2867.93(4816.76) 3015.24(5060.37) 2184.14(4284.24) 2181.36(4051.38) 2713.00(4612.54)
128 512 3149.83(6395.99) 1380.31(3043.26) 1187.99(2936.51) 3485.17(5908.56) 13889.51(20102.06) 10352.55(16418.76) 4567.20(9209.34) 3149.83(6395.99)
128 1024 3152.68(6385.89) 1418.47(3044.78) 1205.74(2950.83) 3540.21(5987.08) 14348.81(21068.35) 9486.46(15085.37) 4549.55(9041.11) 3152.68(6385.89)
128 2048 3200.28(6455.68) 1476.17(3038.72) 3643.84(11269.21) 2918.98(5652.02) 15048.90(22636.64) 9004.92(18176.30) 5113.47(10475.57) 3200.28(6455.68)
128 4096 3295.32(6555.58) 1679.44(3024.81) 6412.64(21712.51) 2848.15(5647.26) 13569.57(22881.78) 8980.50(18174.79) 5142.45(10647.91) 3295.32(6555.58)
128 8192 3338.42(6519.30) 2100.74(3086.32) 2568.41(6543.05) 3011.61(5837.56) 13309.64(21974.66) 9287.97(18097.78) 5775.48(11949.52) 3338.42(6519.30)
128 16384 11371.09(19503.20) 3747.43(3197.73) 1222.98(2309.60) 3065.43(5546.04) 13403.42(21779.87) 9568.73(17933.32) 5752.29(11851.30) 11371.09(19503.20)
128 32768 9273.18(18127.16) 8865.84(11979.62) 1805.81(2428.45) 3486.90(5929.51) 10867.88(18496.86) 9302.34(18080.09) 6026.58(12574.90) 9273.18(18127.16)
128 65536 9094.06(18570.70) 15196.44(16023.66) 4304.25(3354.22) 4532.85(7355.35) 8893.91(17602.65) 9359.78(18103.13) 6373.04(13157.22) 9094.06(18570.70)
128 131072 9248.30(18648.35) 21599.01(19537.89) 7298.66(4853.51) 6158.44(8748.26) 9256.05(18023.50) 10528.02(18082.28) 6604.47(13622.15) 9248.30(18648.35)
128 262144 10061.86(18937.07) 41759.74(44516.10) 11518.40(10318.94) 10147.57(14956.15) 12548.72(21517.08) 11036.23(18010.11) 6933.89(13948.64) 10061.86(18937.07)
128 524288 154569.99(411382.35) 66071.82(46407.64) 13104.86(16139.88) 17745.35(22907.08) 73804.77(209112.42) 169329.06(434459.12) 8801.50(17421.10) 154569.99(411382.35)
128 1048576 77388.83(217694.92) 126609.06(95737.94) 24363.44(33186.15) 33791.94(40007.01) 30265.35(68950.84) 102546.42(366719.82) 12353.21(21203.30) 77388.83(217694.92)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh a small note, I am using similar logic to how the dynamic code works (For an unknown rank #, use the next smallest known rank #). The checks here select the previous smallest data value, so since we have data for 128 and 256, we use the 128 rank data for 128-255, 256 data for 256-511, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is probably a better way to assign the data to ranks instead of ^this method.
Could do some midpoint logic...ie
If we have data for 32, 64, 128 nodes.
Data from 64 nodes would represent [(32 + 64)/2 , (128 + 64)/2) or [48, 96)
if (rank < 48) {
use 32 rank data
} else if (rank < 96) {
use 64 rank data
} else {
use 128 rank data
}
I didn't particularly ask for anyone to place processes with any particular pattern. People selected different ways to submit the jobs. One dataset I had utilized "--map-by ppr:$MAX_NUM_CORES:node --map-by core" and fully subscribed each node. However, most of the datasets I received just ran an mpi command like:
|
That's the data I submitted and the binding was by core |
In this case I am afraid the new decision function are based on a suboptimal use of the network, totally dependent on the process placement. On the good side we might now have an explanation on the discrepancy I noticed on the linear vs. recursive doubling, as their communication scheme will be different, more across inter-node network vs. more across the intra-node. On the bad side, a different process placement will lead to a drastically different collective performance results, and thus a wildly different decision logic. So, let me propose something here. The new module I PR'ed few days ago, #7735, can provide the logic to split the collective into inter and intra node, and could use tuned for each one of these. So, if we change the decision functions in tuned using one process per node, and then use shared for intra-node collectives, we could obtain a highly efficient, process placement agnostic, collective framework. We will however need to redo these experiments with a single process per node. We can talk about this more in details at the next call. |
Hmm, it sounds feasible, but we'd have to re-do data collection. The good news is it wouldn't take very long to do so if we're only doing 1 proc per node. We'd have to agree that the HAN component is going to be the new default component and that the tuned component will be for inter node only. Otherwise I can see some tuning issues arising if HAN isn't used and only tuned is. Two questions I have are: The HAN component doesn't seem to implement all our current collectives yet - is this going to be a "work in progress" kind of situation, what is going to be the behavior for the collectives that HAN doesn't support until we fill the gaps? From a high level can you confirm my understanding - We will use MPI_Comm_split_type(COMM_TYPE_SHARED), which will generate a local comm for each node in the cluster. Then we'll select one rank from each local comm and create an inter node comm. Then for intra node calls, it'll use coll/shared and the intra node comm. For inter node calls, it'll use coll/tuned and the inter node comm. |
@wckzhang would you be able to present a short summary of the findings and new cutoffs on the Tuesday call? I would propose two or three PPT slides with a concise executive summary that is easy to digest. |
Hmm, I can try writing something up and sharing it on Tuesday. We might have to redo some things if we want to go with George's idea. |
@wckzhang just something super easy to understand for a broad audience; just one or two graphics that show here is where we were without your awesome feature and here is where we are now and it will only get better with more data collection. |
@bosilca I think it's too big an ask to have institutions go back and redo these experiments after they've generously donated so many cycles already. I think we need to carefully analyze the data we have. |
Yeah sure, I can do that @jladd-mlnx |
As promised during last week weekly call, here is the process placement impact on collective communication algorithms agnostic to process placement. The results were gathered by Xi on a small cluster at UT (Intel 16x8 cores, IB 100), but the outcome remains consistent across platforms and networks. The labels correspond to OMPI map-by process placement (except random obviously). Good news, we are not the only implementation having troubles with process placement, plus we have a solution to address it before the 5.0. |
Could you perhaps explain what "latency" means when talking about a collective operation? Just want to ensure I really understand the graphs. FWIW: we automatically default to by-core mapping if num procs = 2, and by-socket mapping otherwise. I know IMPI does the same. So it would appear to me that IMPI and OMPI (both ADAPT and tuned) are doing pretty well out-of-the-box, yes? Not trying to get into the argument over the choice of default collective argument - just noting that there is a dimension of the default placement to also consider. |
In this context, latency is time to completion measured globally across all participants (or see it as the time to completion on the slowest of the participants). I would not say they are doing well, it looks like that because the other cases are doing really bad. Being twice as slow, is not, by my standards, a great indication of success. Looking at the bcast, neither IMPI nor OMPI tuned are doing a reasonable job by default. In addition to all this we should also take in account that:
|
a2867a3
to
c39165a
Compare
a08a67c
to
b3d2b20
Compare
The default algorithm selections were out of date and not performing well. After gathering data from OMPI developers, new default algorithm decisions were selected for: allgather allgatherv allreduce alltoall alltoallv barrier bcast gather reduce reduce_scatter_block reduce_scatter scatter These results were gathered using the ompi-collectives-tuning package and then averaged amongst the results gathered from multiple OMPI developers on their clusters. You can access the graphs and averaged data here: https://drive.google.com/drive/folders/1MV5E9gN-5tootoWoh62aoXmN0jiWiqh3 Signed-off-by: William Zhang <wilzhang@amazon.com>
* The new default fixed decision functions were generated based off of | ||
* results that were gathered using the ompi-collectives-tuning package. | ||
* These results were submitted by multiple OMPI developers on their clusters | ||
* and were subsequently averaged to generate the algorithm switch points |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to provide more information of the tunings? ( e.g. the node count, instance type, provider etc.), which will make it clear that the tunings can reflect the performance of common user cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer it remain relatively anonymous, the systems the tests were run on differ greatly and belong to several different companies. The averaged information is all available in the google drive which is described in the commit description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I see, then this PR looks good to me.
The default algorithm selections were out of date and not performing
well. After gathering data from OMPI developers, new default algorithm
decisions were selected for:
Signed-off-by: William Zhang wilzhang@amazon.com