Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coll/acoll: Add support for MPI_Alltoall() #13046

Merged
merged 1 commit into from
Apr 9, 2025

Conversation

MithunMohanKadavil
Copy link
Contributor

This PR adds support for MPI_Alltoall() in acoll collective component.

For messages lower than a few KBs, the algorithm operates by dividing the n (n=comm size) ranks into f groups (based on the value of rank % f), performing alltoall within the f groups in parallel, following which data is exchanged between groups of f adjacent ranks (starting from rank 0). For example, if f=2 , this algorithm splits the n ranks into 2 groups, one containing all even ranked (rank%2 = 0) processes and another containing all odd ranked (rank%2 = 1) processes. After alltoall is done within these 2 groups (in parallel), adjacent even-odd pairs (pairs being [0,1], [2,3]...) exchange data to complete MPI_Alltoall operation. If f=4 or f=8, alltoall is performed in parallel for 4 or 8 groups respectively, followed by data exchange among 4 or 8 adjacent ranks.
The below diagram captures this algorithm for the case where f=2 and n=8:
image
For larger message size range, direct xpmem based copy is used in a linear fashion across all ranks.

The below graphs show the variation in latencies with osu-micro-benchmarks-7.3 for 96 and 192 ranks for tuned and acoll:
image
image

@MithunMohanKadavil
Copy link
Contributor Author

@edgargabriel @lrbison Please review.

@janjust janjust requested a review from lrbison February 11, 2025 16:11
Copy link
Contributor

@lrbison lrbison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My biggest concern is that I don't think this will be correct for derived data types, and I didn't see a catch to check if the buffers are on a GPU (maybe that's at the acoll module level?)

I struggled with these issues myself in my alltoallv implementation, and made what I consider to be a very exhaustive validation tester. It was trivial to extend it from alltoallv to also test alltoall. Would you mind giving it a try? open-mpi/ompi-tests-public#32

Let me also do some testing, in particular I'm curious what would happen on a graviton layout, that can have as much as 96 cores with the same L3 distance.

Comment on lines +48 to +55
if (comm_size <= 8) {
if (total_dsize <= 128) {
(*sync_enable) = true;
} else {
(*sync_enable) = false;
}
(*split_factor) = 2;
} else if (comm_size <= 16) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ripe for tuning file in the future...

@lrbison
Copy link
Contributor

lrbison commented Feb 14, 2025

I tried my validator as follows and I see a validation error:

mpirun --mca coll_acoll_priority 100 -n 4 ./src/alltoallv_ddt -A alltoall -v 2

Looks like I can trigger it on specific tests including --only 2,2, --only 2,3, and --only 2,4 all showing failures.

If you dig into the verbose prints you can see:

--- Starting test 2,2.  Crossing 0 x 1
Created span from 0:4.  Data from 0:4
Created span from 0:48.  Data from 0:48
Datatype      (send,recv) extents (4,48), size (4,48), and lb (0,0)
Datatype TRUE (send,recv) extents (4,48), size (4,48), and lb (0,0)
<nasty assertion fail abort ... >

The "Crossing 0 x 1" refers to send type and receive type respectively in this switch statement
So test 2,2 sends 12 4-byte MPI_INTs, and receives 1 DDT which is a MPI_Type_contiguous vector of 12 integers. The following tests make more and more strange DDTs for checking.

@MithunMohanKadavil
Copy link
Contributor Author

Hi @lrbison, thanks for sharing the tests. We are testing/validating this code using the tests shared and will soon raise an update fixing the issues/addressing the comments.

(*sync_enable) = false;
(*split_factor) = 2;
} else {
(*sync_enable) = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have assumed that the ring algorithm would be the best choice for large communicators and most of the data sizes. How di d you build this decision function ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On our hardware, we observed the linear variants to perform better for lower node counts. However for higher node counts, it still needs to be tuned, and at that point this decision function will be updated.

Copy link
Contributor

@lrbison lrbison Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MithunMohanKadavil Rather than requiring a code-update to update your customer's tuning, would you take a look at this PR and see if it has the required features to support a tuning file format for acoll? I haven't taken a look at exactly what tuning tree you use, but if you could sketch out how you would lay out the file structure, maybe we can make it compatible with tuned and move towards a generic tuning file format with at least the outer structure shared across multiple collective components.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @lrbison , the framework in the PR does have the features required to support the tuning file. However we are yet to decide on the file structure, in the sense that what parameters will be tunable and how it can be applied across the algorithms in acoll. Additionally more tunable parameters may come up for us and as such the tuning file structure still needs to be vetted out from our side.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally more tunable parameters

That is something we expect to address with the new json-format tuning. Should be easier to add a new parameter without breaking all the parsers.

@MithunMohanKadavil
Copy link
Contributor Author

bot:retest

@jsquyres
Copy link
Member

@MithunMohanKadavil Can you rebase instead of merge? We tend to prefer that. Thanks!

@MithunMohanKadavil
Copy link
Contributor Author

MithunMohanKadavil commented Mar 19, 2025

@jsquyres Sorry, my bad. Will correct next time onwards.

@jsquyres
Copy link
Member

@MithunMohanKadavil No worries. No need to close this PR -- you can just remove the merge commit and rebase.

@MithunMohanKadavil
Copy link
Contributor Author

MithunMohanKadavil commented Mar 20, 2025

@jsquyres Removed merge commit and updated the PR, thanks.

@jsquyres
Copy link
Member

@jsquyres Removed merge commit and updated the PR, thanks.

Thank you!

Copy link

Hello! The Git Commit Checker CI bot found a few problems with this PR:

ac77b0c: Using ompi_datatype_create_vector for sending non-...

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

-A new parallel-split algorithm for MPI_Alltoall is introduced as part
of acoll collective component, primarily targeting smaller message sizes
(<= 4KB). The algorithm, at a high level, operates by diving the ranks
into n groups, performing alltoall (using a base alltoall routine)
within the n groups in parallel, following which data is exchanged
between groups of n adjacent ranks (starting from rank 0). For example
if n=2, this algorithm splits the ranks into 2 groups, one containing
all even ranked processes and another containing all odd ranked
processes. Alltoall is performed within these 2 groups in parallel,
followed by which each adjacent even-odd pairs (pairs being [0,1],
[2,3],..) exchanges data to complete Alltoall operation. If n =4 or n=8,
alltoall is performed within 4 or 8 groups in parallel. Following this
step, groups of adjacent 4 or 8 ranks(starting from 0) exchanges data
among themselves to complete the alltoall operation.

Signed-off-by: Mithun Mohan <MithunMohan.KadavilMadanaMohanan@amd.com>
@MithunMohanKadavil MithunMohanKadavil force-pushed the acoll_psplit_alltoall branch 2 times, most recently from ab41ae7 to 8686080 Compare April 9, 2025 11:22
@mshanthagit mshanthagit merged commit 42e302f into open-mpi:main Apr 9, 2025
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants