Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework the MPI_Op support. #9719

Merged
merged 2 commits into from
Jan 23, 2022
Merged

Rework the MPI_Op support. #9719

merged 2 commits into from
Jan 23, 2022

Conversation

bosilca
Copy link
Member

@bosilca bosilca commented Dec 2, 2021

Remove all ops with 3 buffers, we ended up not using them anywhere in
the code.
Change the loop order in the base MPI_Op to allow for more
optimizations, as discussed in #9717.

Signed-off-by: George Bosilca bosilca@icl.utk.edu

@gkatev
Copy link
Contributor

gkatev commented Dec 2, 2021

To be honest I use the 3-buff ops in my non-published code. A memcpy+2-buf might potentially offer the same performance -- I don't know, haven't checked. In any way they are a nice addition, if they are not too big a trouble to maintain could leave them in?

(Edit: I could run some tests to see if they do offer a performance benefit)

optimizations, as discussed in open-mpi#9717.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Add the missing parameter in the help text.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
@bosilca
Copy link
Member Author

bosilca commented Dec 2, 2021

I split the PR in 3 commits to keep the removal of the 3 buffer ops independent. Personally, I do not mind leaving them in, it is just that from OMPI perspective its dead code, we stopped using them a while back and nobody stepped up to maintain them.

Let me know if they offer any performance benefits. If they do, we will have a second reason to keep them around (in addition to having a user for them).

@gkatev
Copy link
Contributor

gkatev commented Dec 2, 2021

I ran some tests:

Option 1

ompi_3buff_op_reduce(op, src2, src, dst, elements, dtype);

Option 2

memcpy(dst, src2, elements * dtype_size);
ompi_op_reduce(op, src, dst, elements, dtype);

Measuring total Allreduce latency, option 1 was better by about 10-20 us (!), out of ~1000 us of total run time (stddev 5~10 us). Measuring the time it took for a specific rank to reduce 1 MB, option 1 took ~220 us, while option 2 took ~320 us.

Edit: If I run the Allreduce with only 4 ranks/cores instead of the entire node, the total run-time again drops by ~20 us, from ~170 us to ~150 us, which is ~10% :-)

@jsquyres
Copy link
Member

jsquyres commented Dec 2, 2021

From reading the code, it looks ok to me, but I have not tested it. Could someone give a 👍 after testing?

@bosilca
Copy link
Member Author

bosilca commented Dec 2, 2021

what about the support for 3 buffers internal ops ?

@bwbarrett
Copy link
Member

what about the support for 3 buffers internal ops ?

I think unless we're actively using them in OMPI (which it appears we stopped doing, because the tradeoffs favored simplicity over the small performance gain), we should remove them. They're always here for historical reasons and our license allows others to suck them into their code base, if that's the right thing for them.

@gkatev
Copy link
Contributor

gkatev commented Dec 3, 2021

Well, how much of a hassle is it to keep them in, given that they are already implemented? Do they have any associated bugs?

Future code could potentially make use of them. In my case the performance difference didn't affect my bottom line, but perhaps under different circumstances, the ~1.5x improvement might.

From a pure "reduction operations API" perspective the 3-buff operations are a nice addition, and in situations where their use is warranted they are noticeably more performant.

@bosilca
Copy link
Member Author

bosilca commented Dec 3, 2021

As they are never used they were never tested, so you really got lucky that they do what they are supposed to.

@bosilca
Copy link
Member Author

bosilca commented Dec 10, 2021

As we were talking about the 3 buffers reductions, someone proposed an extension to the MPI_Reduce_local API to allow the operations to be applied in an order different than inout <op> in.

@awlauria
Copy link
Contributor

awlauria commented Jan 12, 2022

is this preferred over #9717? IE do we take one or the other?

@bosilca
Copy link
Member Author

bosilca commented Jan 12, 2022

This should not be merged as is, the commit removing the 3buffers should be removed. Let me split this PR in 2.

@bosilca
Copy link
Member Author

bosilca commented Jan 12, 2022

The 3 buffer MPI_Op removal is now in #9867. This PR is ready to be discussed.

@jsquyres
Copy link
Member

bot:ompi:retest

@jsquyres jsquyres merged commit 82746f2 into open-mpi:master Jan 23, 2022
@bosilca bosilca deleted the fix/9717 branch January 23, 2022 17:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants