Improve MPI_Comm_split_type scalability #1873

hjelmn · 2016-07-13T21:51:51Z

This is built on top of the refactor work in #1855.

hjelmn · 2016-07-13T21:53:27Z

@bosilca The idea behind this change is most programs are not going to either reorder or drop ranks (with MPI_UNDEFINED). We can optimize the most common case and leave off the global all gather operation.

hjelmn · 2016-07-13T21:56:20Z

There is a slight slowdown for small communicators when reordering (dropping ranks should be similar) but all around the performance looks a lot better than the old algorithm.

This benchmark runs MPI_Comm_split_type in a loop 1000 times and takes the average time to complete. The lines w/out reordering all give MPI_COMM_TYPE_SHARED for split_type and 0 for key. The lines w/ reordering all give MPI_COMM_TYPE_SHARED for split_type and comm_size - rank for key (reversing the rank ordering).

rhc54 · 2016-07-17T17:08:23Z

I'm not sure why there is concern over the performance of comm_split, but @hjelmn did raise a question with me about the time lost in looking up hostnames when that operation is tagged as "optional". Looking into it, the time appears to be spent performing several hash table lookups because we don't have general rules about naming conventions.

Specifically, we could improve this algorithm by:

requiring that any lookup request for job-level info (i.e., info that is not directly tied to a specific rank) by done by specifying a rank of PMIX_RANK_WILDCARD or PMIX_RANK_INVALID. This would let us eliminate 1/3 of the time.
requiring that any modex info use a non-PMIx namespace key - i.e., that user-provided data cannot have a key starting with "pmix". This would eliminate an additional 1/3 of the time.

I cannot see a way to eliminate more than this, but the combination should reduce the time by nearly 70%, and I don't think these rules would be particularly onerous. Is this worth pursuing?

hjelmn · 2016-07-18T14:47:46Z

This is comm_split_type so it is a little more important to optimize it. At a minimum it is a not a great idea to do a all gather on a large comm when any rank only needs to know the result on a small subset of ranks. BTW, in order to do a fair comparison of old vs new I removed the modex receive for the old algorithm as well. On master no ompi_proc_t means not node local.

rhc54 · 2016-07-18T16:11:29Z

Okay, I'll propose adding those requirements to the PMIx specification - if nobody objects, I can update the algorithm.

This commit simplifies the communicator context ID generation by removing the blocking code. The high level calls: ompi_comm_nextcid and ompi_comm_activate remain but now call the non-blocking variants and wait on the resulting request. This was done to remove the parallel paths for context ID generation in preperation for further improvements of the CID generation code. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

This commit introduces a new algorithm for MPI_Comm_split_type. The old algorithm performed an allgather on the communicator to decide which processes were part of the new communicators. This does not scale well in either time or memory. The new algorithm performs a couple of all reductions to determine the global parameters of the MPI_Comm_split_type call. If any rank gives an inconsistent split_type (as defined by the standard) an error is returned without proceeding further. The algorithm then creates a communicator with all the ranks that match the split_type (no communication required) in the same order as the original communicator. It then does an allgather on the new communicator (which should be much smaller) to determine 1) if the new communicator is in the correct order, and 2) if any ranks in the new communicator supplied MPI_UNDEFINED as the split_type. If either of these conditions are detected the new communicator is split using ompi_comm_split and the intermediate communicator is freed. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

rhc54 · 2016-07-18T22:13:47Z

Working its way thru the PMIx approval process: openpmix/openpmix#114

bosilca · 2016-07-19T18:44:59Z

@hjelmn nice catch, this patch indeed improves the case where no processes are processes are using MPI_UNDEFINED. 👍

hjelmn added the Target: v2.x label Jul 13, 2016

hjelmn added this to the v2.1.0 milestone Jul 13, 2016

hjelmn self-assigned this Jul 13, 2016

hjelmn force-pushed the comm_split_update branch from ce77d0f to 80d1856 Compare July 18, 2016 18:02

hjelmn added 2 commits July 18, 2016 12:47

hjelmn force-pushed the comm_split_update branch from 80d1856 to 4c49c42 Compare July 18, 2016 18:47

hjelmn merged commit 40f71f2 into open-mpi:master Jul 19, 2016

hjelmn mentioned this pull request Oct 12, 2016

v2.x communicator code updates #2215

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve MPI_Comm_split_type scalability #1873

Improve MPI_Comm_split_type scalability #1873

hjelmn commented Jul 13, 2016 •

edited

Loading

hjelmn commented Jul 13, 2016 •

edited

Loading

hjelmn commented Jul 13, 2016 •

edited

Loading

rhc54 commented Jul 17, 2016

hjelmn commented Jul 18, 2016

rhc54 commented Jul 18, 2016

rhc54 commented Jul 18, 2016

bosilca commented Jul 19, 2016

Improve MPI_Comm_split_type scalability #1873

Improve MPI_Comm_split_type scalability #1873

Conversation

hjelmn commented Jul 13, 2016 • edited Loading

hjelmn commented Jul 13, 2016 • edited Loading

hjelmn commented Jul 13, 2016 • edited Loading

rhc54 commented Jul 17, 2016

hjelmn commented Jul 18, 2016

rhc54 commented Jul 18, 2016

rhc54 commented Jul 18, 2016

bosilca commented Jul 19, 2016

hjelmn commented Jul 13, 2016 •

edited

Loading

hjelmn commented Jul 13, 2016 •

edited

Loading

hjelmn commented Jul 13, 2016 •

edited

Loading