Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alltoall{,v,w} IN_PLACE fixes #9668

Merged
merged 3 commits into from
Nov 16, 2021

Conversation

bwbarrett
Copy link
Member

b9012a3 used the alltoallw interpretation of rdisps instead of the
alltoall/alltoallv interpretation. According to the MPI standard,
the byte displacement is recvbuf + rdispls[i] * extent(recvtype) for
alltoall and alltoallv, but is recvbuf + rdispls[i] for alltoallw.

The tag used for alltoallv_in_place was also accidently changed to
TAG_ALLTOALLW in the same commit, so change that back to
TAG_ALLTOALLV.

Signed-off-by: Brian Barrett bbarrett@amazon.com

@bwbarrett
Copy link
Member Author

This should fix #9501, allowing us to put #9329 to bed.

Copy link
Member

@bosilca bosilca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, I must have switched to something else in the middle of the patch, because I correctly use the extent for the Alltoall, but not for alltoallv. Nice catch.

I went back to my commit and realized that I used the wrong tag for the Alltoall. Would you mind making that change as well ?

@bwbarrett
Copy link
Member Author

I went back to my commit and realized that I used the wrong tag for the Alltoall. Would you mind making that change as well ?

Doh! I should have checked that. Will update.

@bwbarrett bwbarrett marked this pull request as draft November 16, 2021 04:45
@bwbarrett
Copy link
Member Author

Moving to draft. This fixes the IBM alltoallv_in_place test, but alltoallv_somezeros now hangs. So this needs a bit more work.

@bosilca bosilca force-pushed the bugfix/alltoallv_in_place branch from 8c1691b to ab201da Compare November 16, 2021 08:28
@bosilca
Copy link
Member

bosilca commented Nov 16, 2021

I found the issue. We are issuing 2 receives per iteration (depending on the counts for the 2 neighbors), and we need to wait for both of them to complete before moving to the next iteration. Thus, if 0 == rcounts[right] but 0 != rcounts[left] we will skip the last wait in the loop. There are two outcomes possible:

  1. We might overwrite the send buffer of the next neighbors leading to incorrect result in some corner cases (basically when the left at one iteration becomes the right at the next).
  2. We might deadlock because some requests are not waited on, so MPI_Finalize might block.

@bwbarrett bwbarrett force-pushed the bugfix/alltoallv_in_place branch from ab201da to 7e83c94 Compare November 16, 2021 16:11
@bwbarrett bwbarrett marked this pull request as ready for review November 16, 2021 16:24
bwbarrett and others added 3 commits November 16, 2021 16:25
b9012a3 accidently set the message tags for the alltoall and
alltoallv in place algorithms to the alltoallw tag.  Undo that
change.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
b9012a3 used the alltoallw interpretation of rdisps instead of the
alltoall/alltoallv interpretation.  According to the MPI standard,
the byte displacement is recvbuf + rdispls[i] * extent(recvtype) for
alltoall and alltoallv, but is recvbuf + rdispls[i] for alltoallw.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
Without waiting for the last receive before going to the next iteration we
might overwrite data if the current left neighbor become the right at the next
iteration.

Start with an MPI_REQUEST_NULL request.
If the request is not NULL and the first rcounts is 0, the ompi_request_wait
will segfault.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
@bwbarrett bwbarrett force-pushed the bugfix/alltoallv_in_place branch from 7e83c94 to 6802702 Compare November 16, 2021 16:25
@bwbarrett bwbarrett changed the title Fix disp. calculation in IN_PLACE alltoallv Alltoall{,v,w} IN_PLACE fixes Nov 16, 2021
@bwbarrett
Copy link
Member Author

@bosilca can you re-review after the cleanups?

@bwbarrett bwbarrett merged commit 7c9ab0a into open-mpi:master Nov 16, 2021
@bwbarrett bwbarrett deleted the bugfix/alltoallv_in_place branch November 16, 2021 18:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants