Skip to content

Conversation

hjelmn
Copy link
Member

@hjelmn hjelmn commented Oct 9, 2019

This commit updates the coll/basic component to correctly order sends
and receives for cartesian communicators with cyclic boundaries. This
addresses an issue identified by mpi-forum/mpi-issues#153. This issue
occurs when the size in any dimension is 1. This gives the same
neighbor in the positive and negative directions. The old code was
sending and receiving in the same order so the -1 buffer contained
the +1 result and vise-versa. The problem is addressed by using
unique tags for each send. This should cover both the case where
overtaking is allowed and is not allowed. The former case will be
possible if a MPI_Cart_create_with_info() call is added to the
standard.

Signed-off-by: Nathan Hjelm hjelmn@google.com

This commit updates the coll/basic component to correctly order sends
and receives for cartesian communicators with cyclic boundaries. This
addresses an issue identified by mpi-forum/mpi-issues#153. This issue
occurs when the size in any dimension is 1. This gives the same
neighbor in the positive and negative directions. The old code was
sending and receiving in the same order so the -1 buffer contained
the +1 result and vise-versa. The problem is addressed by using
unique tags for each send. This should cover both the case where
overtaking is allowed and is not allowed. The former case will be
possible is a MPI_Cart_create_with_info() call is added to the
standard.

Signed-off-by: Nathan Hjelm <hjelmn@google.com>
@hjelmn
Copy link
Member Author

hjelmn commented Oct 9, 2019

This doesn't fix the non-blocking variant. There may be additional issues with the other neighbor collectives.

@hjelmn
Copy link
Member Author

hjelmn commented Oct 10, 2019

@bosilca Can you please review.

@hjelmn hjelmn requested a review from bosilca October 10, 2019 01:12
@RolfRabenseifner
Copy link

First, the problem exists also for 2 processes in a direction (not only for 1).
Second, I would be very surprised if MPI_Cart_create_with_info() should allow this very wrong behavior.

@hjelmn
Copy link
Member Author

hjelmn commented Oct 10, 2019

@RolfRabenseifner Indeed. It can happen with 1-2 processes in a direction. Will update the commit message.

As for MPI_Cart_create_with_info, the way the allow overtaking rule is implemented we would run into the same issue if we fixed the issue just by re-ordering the receives. Using tags covers that case as well.

@RolfRabenseifner
Copy link

Be very careful: Two nonblocking MPI_NEIGHBOR_ALLTOALL still need for the same tags that the messages from process to process must not overtake.
In general, when implementing collective communication internally with MPI point-to-point then the overtaking must be switched off, which may be not a problem if the communicator is duplicated without duplicationg this info about overtaking is allowed.

Copy link
Member

@bosilca bosilca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also check the rest of the neighborhood collectives, they might all have the same issue.

nreqs++;
rc = MCA_PML_CALL(irecv(rbuf, rcount, rdtype, srank,
MCA_COLL_BASE_TAG_ALLTOALL,
MCA_COLL_BASE_TAG_NONBLOCKING_BASE - 2 * dim,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tag range was reserved (as the name indicates) for nonblocking collectives. If we are using it for something else (as we should in this particular case) I would suggest we rename it to something more meaningful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. We can add another reserved range to be safe. I can add that to this PR.

@bosilca
Copy link
Member

bosilca commented Oct 10, 2019

@RolfRabenseifner, as indicated in Nathan's comment this patch only applies to blocking collectives.

@hjelmn
Copy link
Member Author

hjelmn commented Oct 10, 2019

@bosilca Yup. The non-blocking fix is much more complicated. This is low hanging fruit.

@bosilca
Copy link
Member

bosilca commented Oct 10, 2019

@hjelmn we might need to have more well-defined ranges for collectives, and have a sliding window of tags to prevent the issues with concurrent collectives.

@bosilca bosilca mentioned this pull request Dec 11, 2019
@jsquyres
Copy link
Member

This PR was replaced by #7232. Closing without merging.

@jsquyres jsquyres closed this Dec 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants