Skip to content

MPI_Comm_create performance #3368

Closed
Closed
@artpol84

Description

@artpol84

Background information

We observing performance problems related to MPI_Comm_create() in Amber app (http://ambermd.org/).

Open MPI version

v1.10, v2.x, and I believe others as well

Reproducer

To ease problem resolution I've created simple reproducer that can be found here: https://github.com/artpol84/poc/tree/master/benchmarks/comm_create.
I was doing my evaluations on 1K procs and the pattern of Amber in my case was creation of lots of small communicators (3 procs per comm), see https://github.com/artpol84/poc/blob/master/benchmarks/comm_create/pattern.h for the detailed info about what ranks are in each communicator.
NOTE: that you need to run this repro on at least 1024 procs unless you modify pattern.h (now it will try to create comm using ranks up to 1K).

Root cause

While investigating this problem I found that the algorithm used in OMPI has real problems with this pattern: OMPI took ~100 times more time compared to MPICH to execute my benchmark.
Currently OMPI uses the following algorithm for MPI_Comm_create (I'm listing only relevant
steps):

0. start = lowest_local_CID; // lowest locally available Comm ID (CID)
1. local_next_cid = next_free_CID(start); 
2. nextcid = Allreduce(old_comm, local_next_cid, MPI_MAX)
3. if( !LOCALLY_USED(nextcid) ){
       local_response = 1;
   } else {
       local_response = 0;
   }
4. response = Allreduce(COMM, local_response, MPI_MIN)
5. if( 1 == response ) {
	goto 6;
   } else { 
        start++; 
        goto 1;
   }
6. new_comm->CID = nextcid;
7. Done

I did the profiling and found that for the provided pattern number of iterations of this algorithm is constantly growing. It ended up doing 479 iterations for 1079'th MPI_Comm_create().
To verify the best perf that we can achieve I hacked OMPI v1.10 as follows:
https://github.com/artpol84/poc/blob/master/benchmarks/comm_create/ompi_hack.patch

Measurements

Here are relative numbers that I've collected with this benchmark:

  • MPICH - 1 (assume MPICH time as 1)
  • OMPI/orig - 106 ( time(OMPI/orig) / time(MPICH) )
  • OMPI/hack - 1.65 ( time(OMPI/hack) / time(MPICH) )
    So although OMPI is slower, numbers are at least comparable.

Future work

I've created experimental PR #3367 to study the ways of improving the algo.

v1.10 vs v2.x

I also noticed that starting from v2.x we use non-blocking version of nextcid, it is not clear how it can improve things as MPI_Comm_create() is blocking and there is a wait anyway.
From the perf perspective v2.x is about 1.2 times slower than v1.10.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions