Description
It has long been discussed, and I swear there was a ticket about this
at some point but I can't find it now. So I'm filing a new one --
close this as a dupe if someone can find an older one.
OMPI currently uses a negative ACK system to indicate if high-speed
networks are not used for MPI communications. For example, if you
have the openib BTL available but it can't find any active ports in a
given MPI process, it'll display a warning message.
But some users want a ''positive'' acknowledgement of what networks
are being used for MPI communications (this can also help with
regression testing, per a thread on the MTT mailing list). HP MPI
offers this feature, for example. It would be nice to have a simple
MCA parameter that will cause MCW rank 0 to output a connectivity map
during MPI_INIT.
Complications:
- In some cases, OMPI doesn't know which networks will be used for
communications with each MPI process peer; we only know which ones
we'll try to use when connections are actually established (per
OMPI's lazy connection model for the OB1 PML). But I think that
even outputting this information will be useful. - Connectivity between MPI processes are likely to be non-uniform.
E.g., MCW rank 0 may use the sm btl to communicate with some MPI
processes, but a different btl to communicate with others. This is
almost certainly a different view than other processes have. The
connectivity information needs to be conveyed on a process-pair
basis (e.g., a 2D chart). - Since we have to span multiple PMLs, this may require an addition
to the PML API.
A first cut could display a simple 2D chart of how OMPI thinks it may
send MPI traffic from each process to each process. Perhaps something
like (OB1 6 process job, 2 processes on each of 3 hosts):
MCW rank 0 1 2 3 4 5
0 self sm tcp tcp tcp tcp
1 sm self tcp tcp tcp tcp
2 tcp tcp self sm tcp tcp
3 tcp tcp sm self tcp tcp
4 tcp tcp tcp tcp self sm
5 tcp tcp tcp tcp sm self
Note that the upper and lower triangular portions of the map are the
same, but it's probably more human-readable if both are output.
However, multiple built-in output formats could be useful, such as:
- Human readable, full map (see above)
- Human readable, abbreviated (see below for some ideas on this)
- Machine parsable, full map
- Machine parsable, abbreviated
It may also be worthwhile to investigate a few huersitics to compress
the graph where possible. Some random ideas in this direction:
- The above example could be represented as:
MPI connectivty map, listed by process:
X->X: self
X<->X+1, X in {0,2,4}: sm
other: tcp
- Another example:
MPI connectivty map, listed by process:
X->X: self
other: tcp
- Another example:
MPI connectivty map, listed by process:
all: CM PML, MX MTL
- Perhaps something could be done with "exceptions" -- e.g., where
the openib BTL is being used for inter-node connectivity ''except''
for one node (where IB is malfunctioning, and OMPI fell back to
TCP) -- this is a common case that users/sysadmins want to detect.
Another useful concept might be to show some information about each
endpoint in the connectivity map. E.g., show a list of TCP endpoints
on each process, by interface name and/or IP address. Similar for
other transports. This kind of information can show when/if
multi-rail scenarios are active, etc. For example:
MCW rank 0 1 2 3 4 5
0 self sm tcp:eth0 tcp:eth0 tcp:eth0 tcp:eth0
1 sm self tcp:eth0 tcp:eth0 tcp:eth0 tcp:eth0
2 tcp:eth0 tcp:eth0 self sm tcp:eth0 tcp:eth0
3 tcp:eth0 tcp:eth0 sm self tcp:eth0 tcp:eth0
4 tcp:eth0 tcp:eth0 tcp:eth0 tcp:eth0 self sm
5 tcp:eth0 tcp:eth0 tcp:eth0 tcp:eth0 sm self
With more information such as interface names, compression of the
output becomes much more important, such as:
MPI connectivty map, listed by process:
X->X: self
X<->X+1, X in {0,2,4}: sm
other: tcp:eth0,eth1
Note that these ideas can certainly be implemented in stages; there's
no need to do everything at once.