-
Notifications
You must be signed in to change notification settings - Fork 868
Accessing the HWLOC topology tree
There are several mechanisms by which OMPI may obtain an HWLOC topology, depending on the environment within which the application is executing and the method by which the application was started. As PMIx continues to roll out across the environment, the variations in how OMPI deals with the topology will hopefully simplify. In the interim, however, OMPI must deal with a variety of use-cases. This document attempts to capture those situations and explain how OMPI interacts with the topology.
Note: this document pertains to version 5.0 and above - while elements of the following discussion can be found in earlier OMPI versions, there may exist nuances that modify their application to that situation. In v5.0 and above, PRRTE is used as the OMPI RTE, and PRRTE (PMIx Reference RunTime Environment) is built with PMIx as its core foundation. Key to the discussion, therefore, is that OMPI v5.0 and above requires PRRTE 2.0 or above, which in turn requires PMIx v4.01 or above.
It is important to note that it is PMIx (and not PRRTE itself) that is often providing the HWLOC topology to the application. This is definitely the case for mpirun launch, and other environments have (so far) followed that model. If PMIx provides the topology, it will come in several forms:
-
if HWLOC 2.x or above is used, then the primary form will be via HWLOC's shmem feature. The shmem rendezvous information is provided in a set of three PMIx keys (PMIX_HWLOC_SHMEM_FILE, PMIX_HWLOC_SHMEM_ADDR, and PMIX_HWLOC_SHMEM_SIZE)
-
if HWLOC 2.x or above is used, then PMIx will also provide the topology as a HWLOC v2 XML string via the PMIX_HWLOC_XML_V2 key. Although one could argue it is a duplication of information, it is provided by default to support environments where shmem may not be available or authorized between the server and client processes (more on that below)
-
regardless of HWLOC version, PMIx also provides the topology as a HWLOC v1 XML string (via the PMIX_HWLOC_XML_V1 key) to support client applications that are linked against an older HWLOC version
Should none of those be available, or if the user has specified a topology file that is to be used in place of whatever the environment provides, then OMPI will either read the topology from the file or perform its own local discovery. The latter is highly discouraged as it leads to significant scaling issues (both in terms of startup time and memory footprint) on complex machines with many cores and multiple layers in their memory hierarchy.
Once the topology has been obtained, the next question one must consider is: what does that topology represent? Is it the topology assigned to the application itself (e.g., via cgroup or soft constraint)? Or is it the overall topology as seen by the base OS? OMPI is designed to utilize the former - i.e., it expects to see the topology assigned to the application, and thus considers any resources present in the topology to be available for its use. It is therefore important to be able to identify the scope of the topology, and to appropriately filter it when necessary.
Unfortunately, the question of when to filter depends upon the method of launch, and (in the case of direct launch) on the architecture of the host environment. Let's consider the various scenarios:
mpirun is always started at the user level. Therefore, both mpirun and its compute node daemons only "see" the topology made available to them by the base OS - i.e., whatever cgroup is being applied to the application has already been reflected in the HWLOC topology discovered by mpirun or the local compute node daemon. Thus, the topology provided by mpirun (regardless of the delivery mechanism) contains a full description of the resources available to that application.
Note that users can launch multiple mpirun applications in parallel within that same allocation, using an appropriate cmd line option (e.g., --cpu-set
) to assign specific subsets of the overall allocation to each invocation. In this case (a soft resource assignment), the topology will have been filtered by each mpirun to reflect the subdivision of resources between invocations - no further processing is required.
The PRRTE DVM is essentially a persistent version of mpirun - it establishes and maintains a set of compute node daemons, each of which operates at the user level and "sees" the topology made available to them by the base OS. The topology they provide to their respective local clients is, therefore, fully constrained.
However, the DVM supports multiple parallel invocations of prun
, each launching a separate application and potentially specifying a different soft resource assignment. The daemons cannot provide a different shmem version of the HWLOC topology for each application, leaving us with the following options in cases where soft assignments have been made:
-
provide applications in this scenario with the topology via one of the other mechanisms (e.g., as v2 XML string). The negative here is that each process then winds up with a complete instance of the topology tree, which can be fairly large (i.e., ~1MByte) for a complex system. Multiplied by significant ppn values, this represents a non-trivial chunk of system memory and is undesirable.
-
provide applications with the "base" shmem topology along with their soft constraints. This requires that each application process "filter" the topology with its constraints. However, the topology in the shmem region is read-only - thus, each process would have to create "shadow" storage of the filtered results for its own use. In addition to the added code complexity, this again increases the footprint of the topology support within each process.
-
have the daemon compute the OMPI-utilized values from the constrained topology using the soft allocation and provide those values to each process using PMIx. This is the method currently utilized by PRRTE/OMPI. The negative is that it requires pre-identifying the information OMPI might desire, which may change over time and according to the needs of specific applications. Extension of PRRTE/PMIx to cover ever broader ranges of use-cases, combined with fallback code paths in OMPI for when the information is not available, has been the chosen approach as it (a) minimizes the resource impact on the application, and (b) further encourages the OMPI/PMIx communities to extend their integration by clearly identifying the decision factors in making algorithm selections.
In the case of direct launch (i.e., launch by a host-provided launcher such as srun for Slurm), the state of the provided topology depends upon the architecture of the launch environment. This typically consists on two cases:
-
Per-job (step) daemon hosting PMIx server. This is the most common setup as it minimizes security issues and provides isolation between invocations. In this case, the daemon itself is subject to the per-application resource constraint (e.g., cgroup), and therefore the PMIx server will detect and provide the constrained topology to the local application processes. No further filtering is required.
-
System daemon hosting PMIx server. This is less common in practice due to the security mismatch - the system daemon must operate at a privileged level, while the application is operating at a user level. However, there are scenarios where this may be permissible or even required. In such cases, the system daemon will expose an unfiltered view of the local topology to all applications executing on that node. This is essentially equivalent to the DVM launch mode described above, except that there is no guarantee that the host environment will provide all the information required by OMPI. Thus, it may be necessary to filter the topology in such cases.
Note that application processes are unlikely to receive topology information under direct launch in non-PMIx environments. In this case, each process must discover the topology for itself - the topology discovered in this manner will be fully constrained. No further filtering is required.
By definition, singletons execute without the support of any RTE. While technically they could connect to a system-level PMIx server, OMPI initializes application processes as PMIx "clients" and not "tools". Thus, the PMIx client library does not support discovery and connection to an arbitrary PMIx server - it requires that either the server identify itself via envars or that the application provide the necessary rendezvous information. Singletons, therefore, must discover and topology for themselves. If operating under external constraints (e.g., cgroups), the discovery will yield an appropriately constrained set of resources. No filtering is therefore required.
While there are flags to indicate if OMPI has been launched by its own RTE (whether mpirun or DVM) versus direct launched, this in itself is not sufficient information to determine if the topology reflects the resources assigned to the application. The best method, therefore is to:
-
attempt to access the desired information directly from PMIx. In many cases (even in the direct launch scenario), all OMPI-required information will have been provided. This includes relative process locality and device (NIC and GPU) distances between each process and their local devices. If found, this information accurately reflects the actual resource utilization/availability for the application, thereby removing the need to directly access the topology itself.
-
if the desired information is not available from PMIx, then one must turn to the topology for the answers. Differentiation should be made between the following cases:
-
launched via mpirun. This case is identified in the environment by the
OMPI_LAUNCHED=1
variable, thereby indicating (among other things) that the topology is fully constrained and no further filtering is required -
launched via DVM. This case is identified in the environment by the
PRTE_LAUNCHED=1
variable. The topology in this case may or may not be filtered. The unfiltered case will be accompanied by thePMIX_CPU_LIST
value - if provided by PMIx, then the topology should be filtered using the provided list (this may require the use of "shadow" storage if the the topology is accessed via a shmem region - which is the default when PRRTE is combined with HWLOC v2.x and above). If this key is not found, then you can assume that the provided topology is complete and no further filtering is required. -
direct launched. The behavior of the host environment obviously lies outside of our control - thus, there are no environmental indicators we can rely upon to signal the state of the topology. However, the common case (per-job step daemons) does not require filtering - therefore, the only real issue is how to identify/handle the unusual (system-level daemon) scenario.
Unfortunately, systems generally do not identify their architecture (step vs system daemon) - the user must independently obtain that knowledge and communicate it via OMPI MCA parameter so that OMPI can adjust to the situation. In the absence of a
PMIX_CPU_LIST
value, there are no good methods for knowing how to filter the topology provided by a system-level daemon - if the system does not provide it, then the user may utilize an OMPI MCA parameter for this purpose. If the system daemon provides the topology in a shmem region, this will again create the need for "shadow" storage - it may instead be preferable to instantiate the topology (either via XML string or direct discovery) and apply the specified filtering in such scenarios.
It is highly desirable that PMIx provide OMPI with all information that might be utilized for operational decisions. This not only ensures best support when operating under mpirun or the PRRTE DVM, but also enables that support to be available more broadly in host environments that support direct launch and include PMIx integration. Encouraging environment vendors to fully provide this support should, therefore, be pursued by both the community and the OMPI customer base.