coll/acoll: Add support for MPI_Alltoall() #13046

MithunMohanKadavil · 2025-01-17T19:09:02Z

This PR adds support for MPI_Alltoall() in acoll collective component.

For messages lower than a few KBs, the algorithm operates by dividing the n (n=comm size) ranks into f groups (based on the value of rank % f), performing alltoall within the f groups in parallel, following which data is exchanged between groups of f adjacent ranks (starting from rank 0). For example, if f=2 , this algorithm splits the n ranks into 2 groups, one containing all even ranked (rank%2 = 0) processes and another containing all odd ranked (rank%2 = 1) processes. After alltoall is done within these 2 groups (in parallel), adjacent even-odd pairs (pairs being [0,1], [2,3]...) exchange data to complete MPI_Alltoall operation. If f=4 or f=8, alltoall is performed in parallel for 4 or 8 groups respectively, followed by data exchange among 4 or 8 adjacent ranks.
The below diagram captures this algorithm for the case where f=2 and n=8:

For larger message size range, direct xpmem based copy is used in a linear fashion across all ranks.

The below graphs show the variation in latencies with osu-micro-benchmarks-7.3 for 96 and 192 ranks for tuned and acoll:

MithunMohanKadavil · 2025-01-20T06:07:02Z

@edgargabriel @lrbison Please review.

lrbison

My biggest concern is that I don't think this will be correct for derived data types, and I didn't see a catch to check if the buffers are on a GPU (maybe that's at the acoll module level?)

I struggled with these issues myself in my alltoallv implementation, and made what I consider to be a very exhaustive validation tester. It was trivial to extend it from alltoallv to also test alltoall. Would you mind giving it a try? open-mpi/ompi-tests-public#32

Let me also do some testing, in particular I'm curious what would happen on a graviton layout, that can have as much as 96 cores with the same L3 distance.

ompi/mca/coll/acoll/coll_acoll_alltoall.c

lrbison · 2025-02-14T15:30:14Z

ompi/mca/coll/acoll/coll_acoll_alltoall.c

+    if (comm_size <= 8) {
+        if (total_dsize <= 128) {
+            (*sync_enable) = true;
+        } else {
+            (*sync_enable) = false;
+        }
+        (*split_factor) = 2;
+    } else if (comm_size <= 16) {


ripe for tuning file in the future...

ompi/mca/coll/acoll/coll_acoll_alltoall.c

lrbison · 2025-02-14T22:53:27Z

I tried my validator as follows and I see a validation error:

mpirun --mca coll_acoll_priority 100 -n 4 ./src/alltoallv_ddt -A alltoall -v 2

Looks like I can trigger it on specific tests including --only 2,2, --only 2,3, and --only 2,4 all showing failures.

If you dig into the verbose prints you can see:

--- Starting test 2,2.  Crossing 0 x 1
Created span from 0:4.  Data from 0:4
Created span from 0:48.  Data from 0:48
Datatype      (send,recv) extents (4,48), size (4,48), and lb (0,0)
Datatype TRUE (send,recv) extents (4,48), size (4,48), and lb (0,0)
<nasty assertion fail abort ... >

The "Crossing 0 x 1" refers to send type and receive type respectively in this switch statement
So test 2,2 sends 12 4-byte MPI_INTs, and receives 1 DDT which is a MPI_Type_contiguous vector of 12 integers. The following tests make more and more strange DDTs for checking.

MithunMohanKadavil · 2025-02-19T13:23:04Z

Hi @lrbison, thanks for sharing the tests. We are testing/validating this code using the tests shared and will soon raise an update fixing the issues/addressing the comments.

ompi/mca/coll/acoll/coll_acoll_alltoall.c

bosilca · 2025-02-22T20:50:03Z

ompi/mca/coll/acoll/coll_acoll_alltoall.c

+            (*sync_enable) = false;
+            (*split_factor) = 2;
+        } else {
+            (*sync_enable) = true;


I would have assumed that the ring algorithm would be the best choice for large communicators and most of the data sizes. How di d you build this decision function ?

On our hardware, we observed the linear variants to perform better for lower node counts. However for higher node counts, it still needs to be tuned, and at that point this decision function will be updated.

@MithunMohanKadavil Rather than requiring a code-update to update your customer's tuning, would you take a look at this PR and see if it has the required features to support a tuning file format for acoll? I haven't taken a look at exactly what tuning tree you use, but if you could sketch out how you would lay out the file structure, maybe we can make it compatible with tuned and move towards a generic tuning file format with at least the outer structure shared across multiple collective components.

Hi @lrbison , the framework in the PR does have the features required to support the tuning file. However we are yet to decide on the file structure, in the sense that what parameters will be tunable and how it can be applied across the algorithms in acoll. Additionally more tunable parameters may come up for us and as such the tuning file structure still needs to be vetted out from our side.

Additionally more tunable parameters

That is something we expect to address with the new json-format tuning. Should be easier to add a new parameter without breaking all the parsers.

ompi/mca/coll/acoll/coll_acoll_alltoall.c

MithunMohanKadavil · 2025-03-07T05:35:26Z

bot:retest

jsquyres · 2025-03-19T12:52:04Z

@MithunMohanKadavil Can you rebase instead of merge? We tend to prefer that. Thanks!

MithunMohanKadavil · 2025-03-19T13:10:29Z

@jsquyres Sorry, my bad. Will correct next time onwards.

jsquyres · 2025-03-19T13:21:03Z

@MithunMohanKadavil No worries. No need to close this PR -- you can just remove the merge commit and rebase.

MithunMohanKadavil · 2025-03-20T05:46:59Z

@jsquyres Removed merge commit and updated the PR, thanks.

jsquyres · 2025-03-20T11:28:12Z

@jsquyres Removed merge commit and updated the PR, thanks.

Thank you!

github-actions · 2025-03-21T08:01:07Z

Hello! The Git Commit Checker CI bot found a few problems with this PR:

ac77b0c: Using ompi_datatype_create_vector for sending non-...

check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

-A new parallel-split algorithm for MPI_Alltoall is introduced as part of acoll collective component, primarily targeting smaller message sizes (<= 4KB). The algorithm, at a high level, operates by diving the ranks into n groups, performing alltoall (using a base alltoall routine) within the n groups in parallel, following which data is exchanged between groups of n adjacent ranks (starting from rank 0). For example if n=2, this algorithm splits the ranks into 2 groups, one containing all even ranked processes and another containing all odd ranked processes. Alltoall is performed within these 2 groups in parallel, followed by which each adjacent even-odd pairs (pairs being [0,1], [2,3],..) exchanges data to complete Alltoall operation. If n =4 or n=8, alltoall is performed within 4 or 8 groups in parallel. Following this step, groups of adjacent 4 or 8 ranks(starting from 0) exchanges data among themselves to complete the alltoall operation. Signed-off-by: Mithun Mohan <MithunMohan.KadavilMadanaMohanan@amd.com>

github-actions bot added the Target: main label Jan 17, 2025

MithunMohanKadavil force-pushed the acoll_psplit_alltoall branch from 81c368e to c117925 Compare January 20, 2025 05:35

janjust requested a review from lrbison February 11, 2025 16:11

lrbison reviewed Feb 14, 2025

View reviewed changes

bosilca reviewed Feb 22, 2025

View reviewed changes

MithunMohanKadavil force-pushed the acoll_psplit_alltoall branch from c117925 to d25450f Compare March 3, 2025 09:16

MithunMohanKadavil closed this Mar 19, 2025

MithunMohanKadavil reopened this Mar 19, 2025

MithunMohanKadavil force-pushed the acoll_psplit_alltoall branch from 9e00186 to 9aafefa Compare March 20, 2025 05:44

MithunMohanKadavil force-pushed the acoll_psplit_alltoall branch from ac77b0c to fb20578 Compare March 21, 2025 08:06

MithunMohanKadavil requested a review from bosilca March 27, 2025 11:25

bosilca approved these changes Apr 7, 2025

View reviewed changes

MithunMohanKadavil force-pushed the acoll_psplit_alltoall branch 2 times, most recently from ab41ae7 to 8686080 Compare April 9, 2025 11:22

mshanthagit merged commit 42e302f into open-mpi:main Apr 9, 2025
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

coll/acoll: Add support for MPI_Alltoall() #13046

coll/acoll: Add support for MPI_Alltoall() #13046

MithunMohanKadavil commented Jan 17, 2025

MithunMohanKadavil commented Jan 20, 2025

lrbison left a comment

lrbison Feb 14, 2025

lrbison commented Feb 14, 2025

MithunMohanKadavil commented Feb 19, 2025

bosilca Feb 22, 2025

MithunMohanKadavil Feb 26, 2025

lrbison Feb 27, 2025 •

edited

Loading

MithunMohanKadavil Mar 10, 2025

lrbison Mar 19, 2025

MithunMohanKadavil commented Mar 7, 2025

jsquyres commented Mar 19, 2025

MithunMohanKadavil commented Mar 19, 2025 •

edited

Loading

jsquyres commented Mar 19, 2025

MithunMohanKadavil commented Mar 20, 2025 •

edited

Loading

jsquyres commented Mar 20, 2025

github-actions bot commented Mar 21, 2025

coll/acoll: Add support for MPI_Alltoall() #13046

coll/acoll: Add support for MPI_Alltoall() #13046

Conversation

MithunMohanKadavil commented Jan 17, 2025

MithunMohanKadavil commented Jan 20, 2025

lrbison left a comment

Choose a reason for hiding this comment

lrbison Feb 14, 2025

Choose a reason for hiding this comment

lrbison commented Feb 14, 2025

MithunMohanKadavil commented Feb 19, 2025

bosilca Feb 22, 2025

Choose a reason for hiding this comment

MithunMohanKadavil Feb 26, 2025

Choose a reason for hiding this comment

lrbison Feb 27, 2025 • edited Loading

Choose a reason for hiding this comment

MithunMohanKadavil Mar 10, 2025

Choose a reason for hiding this comment

lrbison Mar 19, 2025

Choose a reason for hiding this comment

MithunMohanKadavil commented Mar 7, 2025

jsquyres commented Mar 19, 2025

MithunMohanKadavil commented Mar 19, 2025 • edited Loading

jsquyres commented Mar 19, 2025

MithunMohanKadavil commented Mar 20, 2025 • edited Loading

jsquyres commented Mar 20, 2025

github-actions bot commented Mar 21, 2025

lrbison Feb 27, 2025 •

edited

Loading

MithunMohanKadavil commented Mar 19, 2025 •

edited

Loading

MithunMohanKadavil commented Mar 20, 2025 •

edited

Loading