MPI_Comm_create performance

## Background information
We observing performance problems related to `MPI_Comm_create()` in Amber app (http://ambermd.org/).

### Open MPI version
v1.10, v2.x, and I believe others as well

## Reproducer
To ease problem resolution I've created simple reproducer that can be found here:  https://github.com/artpol84/poc/tree/master/benchmarks/comm_create.
I was doing my evaluations on 1K procs and the pattern of Amber in my case was creation of lots of small communicators (3 procs per comm), see https://github.com/artpol84/poc/blob/master/benchmarks/comm_create/pattern.h for the detailed info about what ranks are in each communicator.
**NOTE**: that you need to run this repro on at least 1024 procs unless you modify pattern.h  (now it will try to create comm using ranks up to 1K).

## Root cause
While investigating this problem I found that the algorithm used in OMPI has real problems with this pattern: OMPI took ~100 times more time compared to MPICH to execute my benchmark.
Currently OMPI uses the following algorithm for `MPI_Comm_create` (I'm listing only relevant 
steps):
```
0. start = lowest_local_CID; // lowest locally available Comm ID (CID)
1. local_next_cid = next_free_CID(start); 
2. nextcid = Allreduce(old_comm, local_next_cid, MPI_MAX)
3. if( !LOCALLY_USED(nextcid) ){
       local_response = 1;
   } else {
       local_response = 0;
   }
4. response = Allreduce(COMM, local_response, MPI_MIN)
5. if( 1 == response ) {
	goto 6;
   } else { 
        start++; 
        goto 1;
   }
6. new_comm->CID = nextcid;
7. Done
```
I did the profiling and found that for the provided pattern number of iterations of this algorithm is constantly growing. It ended up doing **479 iterations for 1079'th** `MPI_Comm_create()`.
To verify the best perf that we can achieve I hacked OMPI v1.10 as follows: 
https://github.com/artpol84/poc/blob/master/benchmarks/comm_create/ompi_hack.patch

## Measurements
Here are relative numbers that I've collected with this benchmark:
* MPICH - 1 (assume MPICH time as 1)
* OMPI/orig - 106 ( `time(OMPI/orig) / time(MPICH)` )
* OMPI/hack - 1.65 ( `time(OMPI/hack) / time(MPICH)` )
So although OMPI is slower, numbers are at least comparable.

## Future work
I've created experimental PR https://github.com/open-mpi/ompi/pull/3367 to study the ways of improving the algo.

## v1.10 vs v2.x
I also noticed that starting from v2.x we use non-blocking version of nextcid, it is not clear how it can improve things as `MPI_Comm_create()` is blocking and there is a wait anyway.
From the perf perspective v2.x is about 1.2 times slower than v1.10.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MPI_Comm_create performance #3368

Background information

Open MPI version

Reproducer

Root cause

Measurements

Future work

v1.10 vs v2.x

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

MPI_Comm_create performance #3368

Description

Background information

Open MPI version

Reproducer

Root cause

Measurements

Future work

v1.10 vs v2.x

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions