-
Notifications
You must be signed in to change notification settings - Fork 860
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework the MPI_Op support. #9719
Conversation
To be honest I use the 3-buff ops in my non-published code. A memcpy+2-buf might potentially offer the same performance -- I don't know, haven't checked. In any way they are a nice addition, if they are not too big a trouble to maintain could leave them in? (Edit: I could run some tests to see if they do offer a performance benefit) |
optimizations, as discussed in open-mpi#9717. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Add the missing parameter in the help text. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
I split the PR in 3 commits to keep the removal of the 3 buffer ops independent. Personally, I do not mind leaving them in, it is just that from OMPI perspective its dead code, we stopped using them a while back and nobody stepped up to maintain them. Let me know if they offer any performance benefits. If they do, we will have a second reason to keep them around (in addition to having a user for them). |
I ran some tests: Option 1
Option 2
Measuring total Allreduce latency, option 1 was better by about Edit: If I run the Allreduce with only 4 ranks/cores instead of the entire node, the total run-time again drops by |
From reading the code, it looks ok to me, but I have not tested it. Could someone give a 👍 after testing? |
what about the support for 3 buffers internal ops ? |
I think unless we're actively using them in OMPI (which it appears we stopped doing, because the tradeoffs favored simplicity over the small performance gain), we should remove them. They're always here for historical reasons and our license allows others to suck them into their code base, if that's the right thing for them. |
Well, how much of a hassle is it to keep them in, given that they are already implemented? Do they have any associated bugs? Future code could potentially make use of them. In my case the performance difference didn't affect my bottom line, but perhaps under different circumstances, the ~1.5x improvement might. From a pure "reduction operations API" perspective the 3-buff operations are a nice addition, and in situations where their use is warranted they are noticeably more performant. |
As they are never used they were never tested, so you really got lucky that they do what they are supposed to. |
As we were talking about the 3 buffers reductions, someone proposed an extension to the MPI_Reduce_local API to allow the operations to be applied in an order different than |
is this preferred over #9717? IE do we take one or the other? |
This should not be merged as is, the commit removing the 3buffers should be removed. Let me split this PR in 2. |
The 3 buffer MPI_Op removal is now in #9867. This PR is ready to be discussed. |
bot:ompi:retest |
Remove all ops with 3 buffers, we ended up not using them anywhere in
the code.
Change the loop order in the base MPI_Op to allow for more
optimizations, as discussed in #9717.
Signed-off-by: George Bosilca bosilca@icl.utk.edu