Rework the MPI_Op support. #9719

bosilca · 2021-12-02T06:32:35Z

Remove all ops with 3 buffers, we ended up not using them anywhere in
the code.
Change the loop order in the base MPI_Op to allow for more
optimizations, as discussed in #9717.

Signed-off-by: George Bosilca bosilca@icl.utk.edu

gkatev · 2021-12-02T06:45:42Z

To be honest I use the 3-buff ops in my non-published code. A memcpy+2-buf might potentially offer the same performance -- I don't know, haven't checked. In any way they are a nice addition, if they are not too big a trouble to maintain could leave them in?

(Edit: I could run some tests to see if they do offer a performance benefit)

optimizations, as discussed in open-mpi#9717. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>

Add the missing parameter in the help text. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>

bosilca · 2021-12-02T07:31:15Z

I split the PR in 3 commits to keep the removal of the 3 buffer ops independent. Personally, I do not mind leaving them in, it is just that from OMPI perspective its dead code, we stopped using them a while back and nobody stepped up to maintain them.

Let me know if they offer any performance benefits. If they do, we will have a second reason to keep them around (in addition to having a user for them).

gkatev · 2021-12-02T08:38:53Z

I ran some tests:

Option 1

ompi_3buff_op_reduce(op, src2, src, dst, elements, dtype);

Option 2

memcpy(dst, src2, elements * dtype_size);
ompi_op_reduce(op, src, dst, elements, dtype);

Measuring total Allreduce latency, option 1 was better by about 10-20 us (!), out of ~1000 us of total run time (stddev 5~10 us). Measuring the time it took for a specific rank to reduce 1 MB, option 1 took ~220 us, while option 2 took ~320 us.

Edit: If I run the Allreduce with only 4 ranks/cores instead of the entire node, the total run-time again drops by ~20 us, from ~170 us to ~150 us, which is ~10% :-)

jsquyres · 2021-12-02T15:31:57Z

From reading the code, it looks ok to me, but I have not tested it. Could someone give a 👍 after testing?

bosilca · 2021-12-02T21:21:22Z

what about the support for 3 buffers internal ops ?

bwbarrett · 2021-12-02T22:24:17Z

what about the support for 3 buffers internal ops ?

I think unless we're actively using them in OMPI (which it appears we stopped doing, because the tradeoffs favored simplicity over the small performance gain), we should remove them. They're always here for historical reasons and our license allows others to suck them into their code base, if that's the right thing for them.

gkatev · 2021-12-03T10:40:24Z

Well, how much of a hassle is it to keep them in, given that they are already implemented? Do they have any associated bugs?

Future code could potentially make use of them. In my case the performance difference didn't affect my bottom line, but perhaps under different circumstances, the ~1.5x improvement might.

From a pure "reduction operations API" perspective the 3-buff operations are a nice addition, and in situations where their use is warranted they are noticeably more performant.

bosilca · 2021-12-03T16:39:40Z

As they are never used they were never tested, so you really got lucky that they do what they are supposed to.

bosilca · 2021-12-10T17:53:12Z

As we were talking about the 3 buffers reductions, someone proposed an extension to the MPI_Reduce_local API to allow the operations to be applied in an order different than inout <op> in.

awlauria · 2022-01-12T16:24:06Z

is this preferred over #9717? IE do we take one or the other?

bosilca · 2022-01-12T16:29:19Z

This should not be merged as is, the commit removing the 3buffers should be removed. Let me split this PR in 2.

bosilca · 2022-01-12T17:04:06Z

The 3 buffer MPI_Op removal is now in #9867. This PR is ready to be discussed.

jsquyres · 2022-01-18T16:17:12Z

bot:ompi:retest

bosilca added code cleanup performance labels Dec 2, 2021

bosilca added this to the v5.0.0 milestone Dec 2, 2021

bosilca mentioned this pull request Dec 2, 2021

Optimize repeated pointer accesses in op/base reduction functions #9717

Closed

bosilca force-pushed the fix/9717 branch from f4b4c4f to ed34574 Compare December 2, 2021 06:35

bosilca added 2 commits December 2, 2021 02:27

Change the loop order in the base MPI_Op to allow for more

1215776

optimizations, as discussed in open-mpi#9717. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>

Fix the test to respect arrays boundaries.

786bd35

Add the missing parameter in the help text. Signed-off-by: George Bosilca <bosilca@icl.utk.edu>

bosilca force-pushed the fix/9717 branch from deaac85 to 423a196 Compare December 2, 2021 07:28

jsquyres added the Target: v5.0.x label Jan 3, 2022

gpaulsen requested a review from jsquyres January 12, 2022 16:13

bosilca mentioned this pull request Jan 12, 2022

Remove all ops with 3 buffers #9867

Open

bosilca force-pushed the fix/9717 branch from 423a196 to 786bd35 Compare January 12, 2022 17:03

jsquyres approved these changes Jan 23, 2022

View reviewed changes

jsquyres merged commit 82746f2 into open-mpi:master Jan 23, 2022

jsquyres mentioned this pull request Jan 23, 2022

v5.0.x: Slight MPI_Op optimization #9908

Merged

bosilca deleted the fix/9717 branch January 23, 2022 17:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework the MPI_Op support. #9719

Rework the MPI_Op support. #9719

bosilca commented Dec 2, 2021 •

edited

Loading

gkatev commented Dec 2, 2021 •

edited

Loading

bosilca commented Dec 2, 2021

gkatev commented Dec 2, 2021 •

edited

Loading

jsquyres commented Dec 2, 2021

bosilca commented Dec 2, 2021

bwbarrett commented Dec 2, 2021

gkatev commented Dec 3, 2021

bosilca commented Dec 3, 2021

bosilca commented Dec 10, 2021

awlauria commented Jan 12, 2022 •

edited

Loading

bosilca commented Jan 12, 2022

bosilca commented Jan 12, 2022

jsquyres commented Jan 18, 2022

Rework the MPI_Op support. #9719

Rework the MPI_Op support. #9719

Conversation

bosilca commented Dec 2, 2021 • edited Loading

gkatev commented Dec 2, 2021 • edited Loading

bosilca commented Dec 2, 2021

gkatev commented Dec 2, 2021 • edited Loading

jsquyres commented Dec 2, 2021

bosilca commented Dec 2, 2021

bwbarrett commented Dec 2, 2021

gkatev commented Dec 3, 2021

bosilca commented Dec 3, 2021

bosilca commented Dec 10, 2021

awlauria commented Jan 12, 2022 • edited Loading

bosilca commented Jan 12, 2022

bosilca commented Jan 12, 2022

jsquyres commented Jan 18, 2022

bosilca commented Dec 2, 2021 •

edited

Loading

gkatev commented Dec 2, 2021 •

edited

Loading

gkatev commented Dec 2, 2021 •

edited

Loading

awlauria commented Jan 12, 2022 •

edited

Loading