-
Notifications
You must be signed in to change notification settings - Fork 871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reduce_scatter and reduce_scatter_block algorithms incorrectly handling noncommutative ops #8010
Labels
Comments
@bosilca Do you know if any of these algorithms are incorrectly labelled? |
wckzhang
added a commit
to wckzhang/ompi
that referenced
this issue
Aug 25, 2020
Reduce scatter block and reduce scatter algorithms were hitting correctness issues for non commutative strided tests. We will revert to the original default algorithms for those two collectives (basic linear and non overlapping respectively) in the non commutative op case. See open-mpi#8010 Signed-off-by: William Zhang <wilzhang@amazon.com>
wckzhang
added a commit
to wckzhang/ompi
that referenced
this issue
Aug 25, 2020
Reduce scatter block and reduce scatter algorithms were hitting correctness issues for non commutative strided tests. We will revert to the original default algorithms for those two collectives (basic linear and non overlapping respectively) in the non commutative op case. See open-mpi#8010 Signed-off-by: William Zhang <wilzhang@amazon.com> (cherry picked from commit 57b95bc)
@wckzhang is there a reason this issue is still open? |
Need to investigate these algorithms and relabel them being commute only or determine if they have bugs in their behavior. |
mdosanjh
pushed a commit
to mdosanjh/ompi
that referenced
this issue
Mar 16, 2021
Reduce scatter block and reduce scatter algorithms were hitting correctness issues for non commutative strided tests. We will revert to the original default algorithms for those two collectives (basic linear and non overlapping respectively) in the non commutative op case. See open-mpi#8010 Signed-off-by: William Zhang <wilzhang@amazon.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
OMPI v4.1.x installed from git clone
Ran ompi_tests repository tests:
ibm/collective/reduce_scatter_block_nocommute_stride
ibm/collective/reduce_scatter_block_nocommute_stride_in_place
ibm/collective/reduce_scatter_nocommute_stride
ibm/collective/reduce_scatter_nocommute_stride_in_place
These tests fail with the new default tuned algorithms. I manually selected algorithms to test and found that for RS, algorithms 2, 3, 4 (recursive_halving, ring, butterfly) fail these tests and for RSB, algorithms 2 and 4 (recursive_doubling, butterfly) fail these tests.
The new fixed code only says algorithm 3 (recursive halving) for RSB and algorithm 2 and 3 (recursive halving, ring) do not support non-commute ops. This doesn't match the test results and these algorithms need to be labelled as commute only or fixed if they are supposed to be non commute capable.
The text was updated successfully, but these errors were encountered: