Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coll/acoll: Add support for MPI_Alltoall() #13046

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

MithunMohanKadavil
Copy link
Contributor

This PR adds support for MPI_Alltoall() in acoll collective component.

For messages lower than a few KBs, the algorithm operates by dividing the n (n=comm size) ranks into f groups (based on the value of rank % f), performing alltoall within the f groups in parallel, following which data is exchanged between groups of f adjacent ranks (starting from rank 0). For example, if f=2 , this algorithm splits the n ranks into 2 groups, one containing all even ranked (rank%2 = 0) processes and another containing all odd ranked (rank%2 = 1) processes. After alltoall is done within these 2 groups (in parallel), adjacent even-odd pairs (pairs being [0,1], [2,3]...) exchange data to complete MPI_Alltoall operation. If f=4 or f=8, alltoall is performed in parallel for 4 or 8 groups respectively, followed by data exchange among 4 or 8 adjacent ranks.
The below diagram captures this algorithm for the case where f=2 and n=8:
image
For larger message size range, direct xpmem based copy is used in a linear fashion across all ranks.

The below graphs show the variation in latencies with osu-micro-benchmarks-7.3 for 96 and 192 ranks for tuned and acoll:
image
image

-A new parallel-split algorithm for MPI_Alltoall is introduced as part
of acoll collective component, primarily targeting smaller message sizes
(<= 4KB). The algorithm, at a high level, operates by diving the ranks
into n groups, performing alltoall (using a base alltoall routine)
within the n groups in parallel, following which data is exchanged
between groups of n adjacent ranks (starting from rank 0). For example
if n=2, this algorithm splits the ranks into 2 groups, one containing
all even ranked processes and another containing all odd ranked
processes. Alltoall is performed within these 2 groups in parallel,
followed by which each adjacent even-odd pairs (pairs being [0,1],
[2,3],..) exchanges data to complete Alltoall operation. If n =4 or n=8,
alltoall is performed within 4 or 8 groups in parallel. Following this
step, groups of adjacent 4 or 8 ranks(starting from 0) exchanges data
among themselves to complete the alltoall operation.
-Additionally for intra node cases, an xpmem based linear algorithm for
MPI_Alltoall is added as part of acoll. When sbuf and rbuf can be
exposed via xpmem, the alltoall algorithm can be implemented as linear
direct copy from sbuf of all the other ranks to rbuf of a given rank.

Signed-off-by: Mithun Mohan <MithunMohan.KadavilMadanaMohanan@amd.com>
@MithunMohanKadavil
Copy link
Contributor Author

@edgargabriel @lrbison Please review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant