-
Notifications
You must be signed in to change notification settings - Fork 894
Improve MPI_Comm_split_type scalability #1873
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@bosilca The idea behind this change is most programs are not going to either reorder or drop ranks (with MPI_UNDEFINED). We can optimize the most common case and leave off the global all gather operation. |
There is a slight slowdown for small communicators when reordering (dropping ranks should be similar) but all around the performance looks a lot better than the old algorithm. This benchmark runs MPI_Comm_split_type in a loop 1000 times and takes the average time to complete. The lines w/out reordering all give |
I'm not sure why there is concern over the performance of comm_split, but @hjelmn did raise a question with me about the time lost in looking up hostnames when that operation is tagged as "optional". Looking into it, the time appears to be spent performing several hash table lookups because we don't have general rules about naming conventions. Specifically, we could improve this algorithm by:
I cannot see a way to eliminate more than this, but the combination should reduce the time by nearly 70%, and I don't think these rules would be particularly onerous. Is this worth pursuing? |
This is comm_split_type so it is a little more important to optimize it. At a minimum it is a not a great idea to do a all gather on a large comm when any rank only needs to know the result on a small subset of ranks. BTW, in order to do a fair comparison of old vs new I removed the modex receive for the old algorithm as well. On master no ompi_proc_t means not node local. |
Okay, I'll propose adding those requirements to the PMIx specification - if nobody objects, I can update the algorithm. |
ce77d0f
to
80d1856
Compare
This commit simplifies the communicator context ID generation by removing the blocking code. The high level calls: ompi_comm_nextcid and ompi_comm_activate remain but now call the non-blocking variants and wait on the resulting request. This was done to remove the parallel paths for context ID generation in preperation for further improvements of the CID generation code. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit introduces a new algorithm for MPI_Comm_split_type. The old algorithm performed an allgather on the communicator to decide which processes were part of the new communicators. This does not scale well in either time or memory. The new algorithm performs a couple of all reductions to determine the global parameters of the MPI_Comm_split_type call. If any rank gives an inconsistent split_type (as defined by the standard) an error is returned without proceeding further. The algorithm then creates a communicator with all the ranks that match the split_type (no communication required) in the same order as the original communicator. It then does an allgather on the new communicator (which should be much smaller) to determine 1) if the new communicator is in the correct order, and 2) if any ranks in the new communicator supplied MPI_UNDEFINED as the split_type. If either of these conditions are detected the new communicator is split using ompi_comm_split and the intermediate communicator is freed. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
80d1856
to
4c49c42
Compare
Working its way thru the PMIx approval process: openpmix/openpmix#114 |
@hjelmn nice catch, this patch indeed improves the case where no processes are processes are using MPI_UNDEFINED. 👍 |
This is built on top of the refactor work in #1855.