-
Notifications
You must be signed in to change notification settings - Fork 892
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
coll/acoll: Add support for MPI_Alltoall() #13046
Conversation
81c368e
to
c117925
Compare
@edgargabriel @lrbison Please review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My biggest concern is that I don't think this will be correct for derived data types, and I didn't see a catch to check if the buffers are on a GPU (maybe that's at the acoll module level?)
I struggled with these issues myself in my alltoallv implementation, and made what I consider to be a very exhaustive validation tester. It was trivial to extend it from alltoallv to also test alltoall. Would you mind giving it a try? open-mpi/ompi-tests-public#32
Let me also do some testing, in particular I'm curious what would happen on a graviton layout, that can have as much as 96 cores with the same L3 distance.
if (comm_size <= 8) { | ||
if (total_dsize <= 128) { | ||
(*sync_enable) = true; | ||
} else { | ||
(*sync_enable) = false; | ||
} | ||
(*split_factor) = 2; | ||
} else if (comm_size <= 16) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ripe for tuning file in the future...
I tried my validator as follows and I see a validation error:
Looks like I can trigger it on specific tests including If you dig into the verbose prints you can see:
The "Crossing 0 x 1" refers to send type and receive type respectively in this switch statement |
Hi @lrbison, thanks for sharing the tests. We are testing/validating this code using the tests shared and will soon raise an update fixing the issues/addressing the comments. |
(*sync_enable) = false; | ||
(*split_factor) = 2; | ||
} else { | ||
(*sync_enable) = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have assumed that the ring algorithm would be the best choice for large communicators and most of the data sizes. How di d you build this decision function ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On our hardware, we observed the linear variants to perform better for lower node counts. However for higher node counts, it still needs to be tuned, and at that point this decision function will be updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MithunMohanKadavil Rather than requiring a code-update to update your customer's tuning, would you take a look at this PR and see if it has the required features to support a tuning file format for acoll? I haven't taken a look at exactly what tuning tree you use, but if you could sketch out how you would lay out the file structure, maybe we can make it compatible with tuned and move towards a generic tuning file format with at least the outer structure shared across multiple collective components.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @lrbison , the framework in the PR does have the features required to support the tuning file. However we are yet to decide on the file structure, in the sense that what parameters will be tunable and how it can be applied across the algorithms in acoll. Additionally more tunable parameters may come up for us and as such the tuning file structure still needs to be vetted out from our side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally more tunable parameters
That is something we expect to address with the new json-format tuning. Should be easier to add a new parameter without breaking all the parsers.
c117925
to
d25450f
Compare
bot:retest |
@MithunMohanKadavil Can you rebase instead of merge? We tend to prefer that. Thanks! |
@jsquyres Sorry, my bad. Will correct next time onwards. |
@MithunMohanKadavil No worries. No need to close this PR -- you can just remove the merge commit and rebase. |
9e00186
to
9aafefa
Compare
@jsquyres Removed merge commit and updated the PR, thanks. |
Thank you! |
Hello! The Git Commit Checker CI bot found a few problems with this PR: ac77b0c: Using ompi_datatype_create_vector for sending non-...
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
ac77b0c
to
fb20578
Compare
-A new parallel-split algorithm for MPI_Alltoall is introduced as part of acoll collective component, primarily targeting smaller message sizes (<= 4KB). The algorithm, at a high level, operates by diving the ranks into n groups, performing alltoall (using a base alltoall routine) within the n groups in parallel, following which data is exchanged between groups of n adjacent ranks (starting from rank 0). For example if n=2, this algorithm splits the ranks into 2 groups, one containing all even ranked processes and another containing all odd ranked processes. Alltoall is performed within these 2 groups in parallel, followed by which each adjacent even-odd pairs (pairs being [0,1], [2,3],..) exchanges data to complete Alltoall operation. If n =4 or n=8, alltoall is performed within 4 or 8 groups in parallel. Following this step, groups of adjacent 4 or 8 ranks(starting from 0) exchanges data among themselves to complete the alltoall operation. Signed-off-by: Mithun Mohan <MithunMohan.KadavilMadanaMohanan@amd.com>
ab41ae7
to
8686080
Compare
This PR adds support for MPI_Alltoall() in acoll collective component.
For messages lower than a few KBs, the algorithm operates by dividing the n (n=comm size) ranks into f groups (based on the value of rank % f), performing alltoall within the f groups in parallel, following which data is exchanged between groups of f adjacent ranks (starting from rank 0). For example, if f=2 , this algorithm splits the n ranks into 2 groups, one containing all even ranked (rank%2 = 0) processes and another containing all odd ranked (rank%2 = 1) processes. After alltoall is done within these 2 groups (in parallel), adjacent even-odd pairs (pairs being [0,1], [2,3]...) exchange data to complete MPI_Alltoall operation. If f=4 or f=8, alltoall is performed in parallel for 4 or 8 groups respectively, followed by data exchange among 4 or 8 adjacent ranks.

The below diagram captures this algorithm for the case where f=2 and n=8:
For larger message size range, direct xpmem based copy is used in a linear fashion across all ranks.
The below graphs show the variation in latencies with osu-micro-benchmarks-7.3 for 96 and 192 ranks for tuned and acoll:

