Skip to content

Show MPI connectivity map during MPI_INIT #30

Closed
@ompiteam

Description

@ompiteam

It has long been discussed, and I swear there was a ticket about this
at some point but I can't find it now. So I'm filing a new one --
close this as a dupe if someone can find an older one.


OMPI currently uses a negative ACK system to indicate if high-speed
networks are not used for MPI communications. For example, if you
have the openib BTL available but it can't find any active ports in a
given MPI process, it'll display a warning message.

But some users want a ''positive'' acknowledgement of what networks
are being used for MPI communications (this can also help with
regression testing, per a thread on the MTT mailing list). HP MPI
offers this feature, for example. It would be nice to have a simple
MCA parameter that will cause MCW rank 0 to output a connectivity map
during MPI_INIT.

Complications:

  • In some cases, OMPI doesn't know which networks will be used for
    communications with each MPI process peer; we only know which ones
    we'll try to use when connections are actually established (per
    OMPI's lazy connection model for the OB1 PML). But I think that
    even outputting this information will be useful.
  • Connectivity between MPI processes are likely to be non-uniform.
    E.g., MCW rank 0 may use the sm btl to communicate with some MPI
    processes, but a different btl to communicate with others. This is
    almost certainly a different view than other processes have. The
    connectivity information needs to be conveyed on a process-pair
    basis (e.g., a 2D chart).
  • Since we have to span multiple PMLs, this may require an addition
    to the PML API.

A first cut could display a simple 2D chart of how OMPI thinks it may
send MPI traffic from each process to each process. Perhaps something
like (OB1 6 process job, 2 processes on each of 3 hosts):

MCW rank 0     1     2     3     4     5
0        self  sm    tcp   tcp   tcp   tcp
1        sm    self  tcp   tcp   tcp   tcp
2        tcp   tcp   self  sm    tcp   tcp
3        tcp   tcp   sm    self  tcp   tcp
4        tcp   tcp   tcp   tcp   self  sm
5        tcp   tcp   tcp   tcp   sm    self

Note that the upper and lower triangular portions of the map are the
same, but it's probably more human-readable if both are output.
However, multiple built-in output formats could be useful, such as:

  • Human readable, full map (see above)
  • Human readable, abbreviated (see below for some ideas on this)
  • Machine parsable, full map
  • Machine parsable, abbreviated

It may also be worthwhile to investigate a few huersitics to compress
the graph where possible. Some random ideas in this direction:

  • The above example could be represented as:
MPI connectivty map, listed by process:
X->X: self
X<->X+1, X in {0,2,4}: sm
other: tcp
  • Another example:
MPI connectivty map, listed by process:
X->X: self
other: tcp
  • Another example:
MPI connectivty map, listed by process:
all: CM PML, MX MTL
  • Perhaps something could be done with "exceptions" -- e.g., where
    the openib BTL is being used for inter-node connectivity ''except''
    for one node (where IB is malfunctioning, and OMPI fell back to
    TCP) -- this is a common case that users/sysadmins want to detect.

Another useful concept might be to show some information about each
endpoint in the connectivity map. E.g., show a list of TCP endpoints
on each process, by interface name and/or IP address. Similar for
other transports. This kind of information can show when/if
multi-rail scenarios are active, etc. For example:

MCW rank 0     1     2     3     4     5
0        self      sm        tcp:eth0  tcp:eth0  tcp:eth0  tcp:eth0
1        sm        self      tcp:eth0  tcp:eth0  tcp:eth0  tcp:eth0
2        tcp:eth0  tcp:eth0  self      sm        tcp:eth0  tcp:eth0
3        tcp:eth0  tcp:eth0  sm        self      tcp:eth0  tcp:eth0
4        tcp:eth0  tcp:eth0  tcp:eth0  tcp:eth0  self      sm
5        tcp:eth0  tcp:eth0  tcp:eth0  tcp:eth0  sm        self

With more information such as interface names, compression of the
output becomes much more important, such as:

MPI connectivty map, listed by process:
X->X: self
X<->X+1, X in {0,2,4}: sm
other: tcp:eth0,eth1

Note that these ideas can certainly be implemented in stages; there's
no need to do everything at once.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions