Skip to content

coll/tuned: Change the default collective algorithm selection #7730

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 28, 2020

Conversation

wckzhang
Copy link
Contributor

The default algorithm selections were out of date and not performing
well. After gathering data from OMPI developers, new default algorithm
decisions were selected for:

allgather
allgatherv
allreduce
alltoall
alltoallv
barrier
bcast
gather
reduce
reduce_scatter_block
reduce_scatter
scatter

Signed-off-by: William Zhang wilzhang@amazon.com

@wckzhang
Copy link
Contributor Author

Current iteration just uses if/else blocks. I originally had it using range switch cases, but that is non standard so I threw that out. In the future, I want to revamp the fixed code to work more like dynamic code in that it checks a data structure to avoid as many branches and checks.

It looks like there needs to be some bugfixes in the dynamic code - there are no checks for non commutative ops, which I would like to see.

@wckzhang
Copy link
Contributor Author

You can find the graphs and averaged data here:
https://drive.google.com/open?id=1MV5E9gN-5tootoWoh62aoXmN0jiWiqh3

@wckzhang
Copy link
Contributor Author

Need to update non commutative/commutative code. Also check if any algs must use power of 2

@ompiteam-bot
Copy link

Can one of the admins verify this patch?

@wckzhang
Copy link
Contributor Author

Looks like there are no issues with non power of two cases. There is a fallback available in all the algorithms currently in the tuned component.

@jjhursey
Copy link
Member

ok to test

@wckzhang
Copy link
Contributor Author

Rewrote the non commutative checks, should be good to go now

Copy link
Member

@bosilca bosilca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind of snoozed during the discussion about how this performance was gathered. So let me ask here, how where these processes placed on the nodes when the experiments were done ?

if (total_dsize < 2048) {
alg = 2;
} else if (total_dsize < 16384) {
alg = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand there is data to back this up, but I just can’t wrap my mind around the claim that An allreduce implemented as a reduce followed by a bcast Can have better performance than a recursive doubling implementation. We are talking about at least twice the latency, so even if the data is small, when you have 256 processes this should count for something.

Copy link
Contributor Author

@wckzhang wckzhang May 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's only one size where it outperforms:
From the data, it seems like recursive doubling doesn't do well until larger sizes
0 - current defaults
1 - linear
2 - non overlapping
3 - recursive doubling

I can't say I understand why this is the case, but I have the data for it.

Allreduce
#Nnodes    Message_size 0                    1                    2                    3                    4                    5                    6   
128        4            2913.64(4878.65)     1344.74(3104.49)     1259.40(3147.54)     3230.74(5481.80)     3002.00(4954.51)     2857.97(4782.51)     2826.67(4517.93)     2913.64(4878.65)    
128        8            2899.49(4864.18)     1316.23(3038.82)     1184.93(2940.13)     3150.57(5384.97)     3071.26(5012.43)     2879.02(4781.92)     2811.67(4574.71)     2899.49(4864.18)    
128        16           2866.31(4821.69)     1312.63(3046.77)     1182.64(2946.13)     3097.51(5284.69)     2958.97(4890.04)     2872.01(4829.44)     2810.05(4509.48)     2866.31(4821.69)    
128        32           2914.33(4893.33)     1308.55(3047.86)     1188.79(2947.79)     3155.50(5369.73)     2982.11(4939.97)     2580.43(4431.11)     2748.29(4489.69)     2914.33(4893.33)    
128        64           2905.83(4882.22)     1316.48(3050.56)     1184.62(2949.98)     3047.34(5201.51)     3024.41(5004.69)     2202.25(4277.14)     2673.65(4290.02)     2905.83(4882.22)    
128        128          2915.41(4892.76)     1326.50(3045.12)     1182.40(2937.46)     2886.84(4847.97)     2928.42(4899.58)     2143.79(4226.11)     1998.89(3815.83)     2915.41(4892.76)    
128        256          2713.00(4612.54)     1334.70(3033.14)     1188.84(2936.60)     2867.93(4816.76)     3015.24(5060.37)     2184.14(4284.24)     2181.36(4051.38)     2713.00(4612.54)    
128        512          3149.83(6395.99)     1380.31(3043.26)     1187.99(2936.51)     3485.17(5908.56)     13889.51(20102.06)   10352.55(16418.76)   4567.20(9209.34)     3149.83(6395.99)    
128        1024         3152.68(6385.89)     1418.47(3044.78)     1205.74(2950.83)     3540.21(5987.08)     14348.81(21068.35)   9486.46(15085.37)    4549.55(9041.11)     3152.68(6385.89)    
128        2048         3200.28(6455.68)     1476.17(3038.72)     3643.84(11269.21)    2918.98(5652.02)     15048.90(22636.64)   9004.92(18176.30)    5113.47(10475.57)    3200.28(6455.68)    
128        4096         3295.32(6555.58)     1679.44(3024.81)     6412.64(21712.51)    2848.15(5647.26)     13569.57(22881.78)   8980.50(18174.79)    5142.45(10647.91)    3295.32(6555.58)    
128        8192         3338.42(6519.30)     2100.74(3086.32)     2568.41(6543.05)     3011.61(5837.56)     13309.64(21974.66)   9287.97(18097.78)    5775.48(11949.52)    3338.42(6519.30)    
128        16384        11371.09(19503.20)   3747.43(3197.73)     1222.98(2309.60)     3065.43(5546.04)     13403.42(21779.87)   9568.73(17933.32)    5752.29(11851.30)    11371.09(19503.20)  
128        32768        9273.18(18127.16)    8865.84(11979.62)    1805.81(2428.45)     3486.90(5929.51)     10867.88(18496.86)   9302.34(18080.09)    6026.58(12574.90)    9273.18(18127.16)   
128        65536        9094.06(18570.70)    15196.44(16023.66)   4304.25(3354.22)     4532.85(7355.35)     8893.91(17602.65)    9359.78(18103.13)    6373.04(13157.22)    9094.06(18570.70)   
128        131072       9248.30(18648.35)    21599.01(19537.89)   7298.66(4853.51)     6158.44(8748.26)     9256.05(18023.50)    10528.02(18082.28)   6604.47(13622.15)    9248.30(18648.35)   
128        262144       10061.86(18937.07)   41759.74(44516.10)   11518.40(10318.94)   10147.57(14956.15)   12548.72(21517.08)   11036.23(18010.11)   6933.89(13948.64)    10061.86(18937.07)  
128        524288       154569.99(411382.35) 66071.82(46407.64)   13104.86(16139.88)   17745.35(22907.08)   73804.77(209112.42)  169329.06(434459.12) 8801.50(17421.10)    154569.99(411382.35)
128        1048576      77388.83(217694.92)  126609.06(95737.94)  24363.44(33186.15)   33791.94(40007.01)   30265.35(68950.84)   102546.42(366719.82) 12353.21(21203.30)   77388.83(217694.92) 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh a small note, I am using similar logic to how the dynamic code works (For an unknown rank #, use the next smallest known rank #). The checks here select the previous smallest data value, so since we have data for 128 and 256, we use the 128 rank data for 128-255, 256 data for 256-511, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is probably a better way to assign the data to ranks instead of ^this method.

Could do some midpoint logic...ie

If we have data for 32, 64, 128 nodes.

Data from 64 nodes would represent [(32 + 64)/2 , (128 + 64)/2) or [48, 96)

if (rank < 48) {
    use 32 rank data
} else if (rank < 96) {
    use 64 rank data
} else {
    use 128 rank data
}

@wckzhang
Copy link
Contributor Author

I kind of snoozed during the discussion about how this performance was gathered. So let me ask here, how where these processes placed on the nodes when the experiments were done ?

I didn't particularly ask for anyone to place processes with any particular pattern. People selected different ways to submit the jobs. One dataset I had utilized "--map-by ppr:$MAX_NUM_CORES:node --map-by core" and fully subscribed each node. However, most of the datasets I received just ran an mpi command like:

mpirun --np 4 --mca coll_tuned_use_dynamic_rules 1 --mca coll_tuned_allgather_algorithm 0 osu_allgather

@naughtont3
Copy link
Contributor

One dataset I had utilized "--map-by ppr:$MAX_NUM_CORES:node --map-by core" and fully subscribed each node.

That's the data I submitted and the binding was by core --bind-to core minor typo in above comment.

@bosilca
Copy link
Member

bosilca commented May 15, 2020

In this case I am afraid the new decision function are based on a suboptimal use of the network, totally dependent on the process placement.

On the good side we might now have an explanation on the discrepancy I noticed on the linear vs. recursive doubling, as their communication scheme will be different, more across inter-node network vs. more across the intra-node. On the bad side, a different process placement will lead to a drastically different collective performance results, and thus a wildly different decision logic.

So, let me propose something here. The new module I PR'ed few days ago, #7735, can provide the logic to split the collective into inter and intra node, and could use tuned for each one of these. So, if we change the decision functions in tuned using one process per node, and then use shared for intra-node collectives, we could obtain a highly efficient, process placement agnostic, collective framework. We will however need to redo these experiments with a single process per node. We can talk about this more in details at the next call.

@wckzhang
Copy link
Contributor Author

In this case I am afraid the new decision function are based on a suboptimal use of the network, totally dependent on the process placement.

On the good side we might now have an explanation on the discrepancy I noticed on the linear vs. recursive doubling, as their communication scheme will be different, more across inter-node network vs. more across the intra-node. On the bad side, a different process placement will lead to a drastically different collective performance results, and thus a wildly different decision logic.

So, let me propose something here. The new module I PR'ed few days ago, #7735, can provide the logic to split the collective into inter and intra node, and could use tuned for each one of these. So, if we change the decision functions in tuned using one process per node, and then use shared for intra-node collectives, we could obtain a highly efficient, process placement agnostic, collective framework. We will however need to redo these experiments with a single process per node. We can talk about this more in details at the next call.

Hmm, it sounds feasible, but we'd have to re-do data collection. The good news is it wouldn't take very long to do so if we're only doing 1 proc per node. We'd have to agree that the HAN component is going to be the new default component and that the tuned component will be for inter node only. Otherwise I can see some tuning issues arising if HAN isn't used and only tuned is.

Two questions I have are:

The HAN component doesn't seem to implement all our current collectives yet - is this going to be a "work in progress" kind of situation, what is going to be the behavior for the collectives that HAN doesn't support until we fill the gaps?

From a high level can you confirm my understanding - We will use MPI_Comm_split_type(COMM_TYPE_SHARED), which will generate a local comm for each node in the cluster. Then we'll select one rank from each local comm and create an inter node comm. Then for intra node calls, it'll use coll/shared and the intra node comm. For inter node calls, it'll use coll/tuned and the inter node comm.

@jladd-mlnx
Copy link
Member

@wckzhang would you be able to present a short summary of the findings and new cutoffs on the Tuesday call? I would propose two or three PPT slides with a concise executive summary that is easy to digest.

@jladd-mlnx
Copy link
Member

@awlauria @gpaulsen - Adding Geoff and Austen for visibility.

@wckzhang
Copy link
Contributor Author

Hmm, I can try writing something up and sharing it on Tuesday. We might have to redo some things if we want to go with George's idea.

@jladd-mlnx
Copy link
Member

@wckzhang just something super easy to understand for a broad audience; just one or two graphics that show here is where we were without your awesome feature and here is where we are now and it will only get better with more data collection.

@jladd-mlnx
Copy link
Member

@bosilca I think it's too big an ask to have institutions go back and redo these experiments after they've generously donated so many cycles already. I think we need to carefully analyze the data we have.

@wckzhang
Copy link
Contributor Author

Yeah sure, I can do that @jladd-mlnx

@bosilca
Copy link
Member

bosilca commented May 25, 2020

As promised during last week weekly call, here is the process placement impact on collective communication algorithms agnostic to process placement. The results were gathered by Xi on a small cluster at UT (Intel 16x8 cores, IB 100), but the outcome remains consistent across platforms and networks. The labels correspond to OMPI map-by process placement (except random obviously).

Good news, we are not the only implementation having troubles with process placement, plus we have a solution to address it before the 5.0.

cpu_process_mapping_dancer

@rhc54
Copy link
Contributor

rhc54 commented May 26, 2020

Could you perhaps explain what "latency" means when talking about a collective operation? Just want to ensure I really understand the graphs.

FWIW: we automatically default to by-core mapping if num procs = 2, and by-socket mapping otherwise. I know IMPI does the same. So it would appear to me that IMPI and OMPI (both ADAPT and tuned) are doing pretty well out-of-the-box, yes?

Not trying to get into the argument over the choice of default collective argument - just noting that there is a dimension of the default placement to also consider.

@bosilca
Copy link
Member

bosilca commented May 26, 2020

In this context, latency is time to completion measured globally across all participants (or see it as the time to completion on the slowest of the participants).

I would not say they are doing well, it looks like that because the other cases are doing really bad. Being twice as slow, is not, by my standards, a great indication of success. Looking at the bcast, neither IMPI nor OMPI tuned are doing a reasonable job by default. In addition to all this we should also take in account that:

  • the difference you see here will be only exacerbated at scale as the number of level in the communication tree will keep increasing.
  • the number of cores in this machine (8) and the number of nodes (16) were nicely fitting into most of the communication trees (both binary and binomial). Most large scale machine nowadays would not fit the same pattern.
  • most applications are using collective on subcommunicators, so our initial process placement is irrelevant.

@wckzhang wckzhang force-pushed the newdefaults branch 2 times, most recently from a2867a3 to c39165a Compare July 6, 2020 17:55
@wckzhang wckzhang force-pushed the newdefaults branch 2 times, most recently from a08a67c to b3d2b20 Compare July 28, 2020 16:05
The default algorithm selections were out of date and not performing
well. After gathering data from OMPI developers, new default algorithm
decisions were selected for:

    allgather
    allgatherv
    allreduce
    alltoall
    alltoallv
    barrier
    bcast
    gather
    reduce
    reduce_scatter_block
    reduce_scatter
    scatter

These results were gathered using the ompi-collectives-tuning package
and then averaged amongst the results gathered from multiple OMPI
developers on their clusters.

You can access the graphs and averaged data here:
https://drive.google.com/drive/folders/1MV5E9gN-5tootoWoh62aoXmN0jiWiqh3

Signed-off-by: William Zhang <wilzhang@amazon.com>
* The new default fixed decision functions were generated based off of
* results that were gathered using the ompi-collectives-tuning package.
* These results were submitted by multiple OMPI developers on their clusters
* and were subsequently averaged to generate the algorithm switch points
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to provide more information of the tunings? ( e.g. the node count, instance type, provider etc.), which will make it clear that the tunings can reflect the performance of common user cases?

Copy link
Contributor Author

@wckzhang wckzhang Jul 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer it remain relatively anonymous, the systems the tests were run on differ greatly and belong to several different companies. The averaged information is all available in the google drive which is described in the commit description.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I see, then this PR looks good to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants