Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reduce_scatter and reduce_scatter_block algorithms incorrectly handling noncommutative ops #8010

Open
wckzhang opened this issue Aug 18, 2020 · 3 comments
Assignees

Comments

@wckzhang
Copy link
Contributor

OMPI v4.1.x installed from git clone

Ran ompi_tests repository tests:

ibm/collective/reduce_scatter_block_nocommute_stride
ibm/collective/reduce_scatter_block_nocommute_stride_in_place
ibm/collective/reduce_scatter_nocommute_stride
ibm/collective/reduce_scatter_nocommute_stride_in_place

These tests fail with the new default tuned algorithms. I manually selected algorithms to test and found that for RS, algorithms 2, 3, 4 (recursive_halving, ring, butterfly) fail these tests and for RSB, algorithms 2 and 4 (recursive_doubling, butterfly) fail these tests.

The new fixed code only says algorithm 3 (recursive halving) for RSB and algorithm 2 and 3 (recursive halving, ring) do not support non-commute ops. This doesn't match the test results and these algorithms need to be labelled as commute only or fixed if they are supposed to be non commute capable.

@wckzhang
Copy link
Contributor Author

@bosilca Do you know if any of these algorithms are incorrectly labelled?

wckzhang added a commit to wckzhang/ompi that referenced this issue Aug 25, 2020
Reduce scatter block and reduce scatter algorithms were hitting
correctness issues for non commutative strided tests. We will revert to
the original default algorithms for those two collectives (basic linear
and non overlapping respectively) in the non commutative op case.

See open-mpi#8010

Signed-off-by: William Zhang <wilzhang@amazon.com>
wckzhang added a commit to wckzhang/ompi that referenced this issue Aug 25, 2020
Reduce scatter block and reduce scatter algorithms were hitting
correctness issues for non commutative strided tests. We will revert to
the original default algorithms for those two collectives (basic linear
and non overlapping respectively) in the non commutative op case.

See open-mpi#8010

Signed-off-by: William Zhang <wilzhang@amazon.com>
(cherry picked from commit 57b95bc)
@bwbarrett
Copy link
Member

@wckzhang is there a reason this issue is still open?

@wckzhang
Copy link
Contributor Author

Need to investigate these algorithms and relabel them being commute only or determine if they have bugs in their behavior.

mdosanjh pushed a commit to mdosanjh/ompi that referenced this issue Mar 16, 2021
Reduce scatter block and reduce scatter algorithms were hitting
correctness issues for non commutative strided tests. We will revert to
the original default algorithms for those two collectives (basic linear
and non overlapping respectively) in the non commutative op case.

See open-mpi#8010

Signed-off-by: William Zhang <wilzhang@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants