Accessing the HWLOC topology tree

There are several mechanisms by which OMPI may obtain an HWLOC topology, depending on the environment within which the application is executing and the method by which the application was started. As PMIx continues to roll out across the environment, the variations in how OMPI deals with the topology will hopefully simplify. In the interim, however, OMPI must deal with a variety of use-cases. This document attempts to capture those situations and explain how OMPI interacts with the topology.

Note: this document pertains to version 5.0 and above - while elements of the following discussion can be found in earlier OMPI versions, there may exist nuances that modify their application to that situation. In v5.0 and above, PRRTE is used as the OMPI RTE, and PRRTE (PMIx Reference RunTime Environment) is built with PMIx as its core foundation. Key to the discussion, therefore, is that OMPI v5.0 and above requires PRRTE 2.0 or above, which in turn requires PMIx v4.01 or above.

It is important to note that it is PMIx (and not PRRTE itself) that is often providing the HWLOC topology to the application. This is definitely the case for mpirun launch, and other environments have (so far) followed that model. If PMIx provides the topology, it will come in several forms:

if HWLOC 2.x or above is used, then the primary form will be via HWLOC's shmem feature. The shmem rendezvous information is provided in a set of three PMIx keys (PMIX_HWLOC_SHMEM_FILE, PMIX_HWLOC_SHMEM_ADDR, and PMIX_HWLOC_SHMEM_SIZE)
if HWLOC 2.x or above is used, then PMIx will also provide the topology as a HWLOC v2 XML string. Although one could argue it is a duplication of information, it is provided by default to support environments where shmem may not be available or authorized between the server and client processes (more on that below)
regardless of HWLOC version, PMIx also provides the topology as a HWLOC v1 XML string to support client applications that are linked against an older HWLOC version

Should none of those be available, or if the user has specified a topology file that is to be used in place of whatever the environment provides, then OMPI will either read the topology from the file or perform its own local discovery. The latter is highly discouraged as it leads to significant scaling issues (both in terms of startup time and memory footprint) on complex machines with many cores and multiple layers in their memory hierarchy.

Once the topology has been obtained, the next question one must consider is: what does that topology represent? Is it the topology assigned to the application itself (e.g., via cgroup)? Or is it the overall topology as seen by the base OS? OMPI is designed to utilize the former - i.e., it expects to see the topology assigned to the application, and thus considers any resources present in the topology to be available for its use. It is therefore important to be able to identify the scope of the topology, and to appropriately filter it when necessary.

Unfortunately, the question of when to filter depends upon the method of launch, and (in the case of direct launch) on the architecture of the host environment. Let's consider the various scenarios:

mpirun launch

mpirun is always started at the user level. Therefore, both mpirun and its compute node daemons only "see" the topology made available to them by the base OS - i.e., whatever cgroup is being applied to the application has already been reflected in the HWLOC topology discovered by mpirun or the local compute node daemon. Thus, the topology provided by mpirun (regardless of the delivery mechanism) contains a full description of the resources available to that application.

Note that users can launch multiple mpirun applications in parallel within that same allocation, using an appropriate cmd line option (e.g., --cpu-set) to assign specific subsets of the overall allocation to each invocation. In this case (a soft resource assignment), the topology will have been filtered by each mpirun to reflect the subdivision of resources between invocations - no further processing is required.

DVM launch

The PRRTE DVM is essentially a persistent version of mpirun - it establishes and maintains a set of compute node daemons, each of which "sees" the topology made available to them by the base OS. The topology they provide to their respective local clients is, therefore, fully constrained.

However, the DVM supports multiple parallel invocations of prun, each launching a separate application and potentially specifying a different soft resource assignment. The daemons cannot provide a different shmem version of the HWLOC topology for each application, leaving us with the following options in cases where soft assignments have been made: * provide applications in this scenario with the topology via one of the other mechanisms (e.g., as v2 XML string). The negative here is that each process then winds up with a complete instance of the topology tree, which can be fairly large (i.e., ~1MByte) for a complex system. Multiplied by significant ppn values, this represents a non-trivial chunk of system memory and is undesirable. * provide applications with the "base" shmem topology along with their soft constraints. This requires that each application process "filter" the topology with its constraints. However, the topology in the shmem region is read-only - thus, each process would have to create "shadow" storage of the filtered results for its own use. In addition to the added code complexity, this again increases the footprint of the topology support within each process. * have the daemon compute the OMPI-utilized values from the constrained topology using the soft allocation and provide those values to each process using PMIx. This is the method currently utilized by PRRTE/OMPI. The negative is that it requires pre-identifying the information OMPI might desire, which may change over time and according to the needs of specific applications. Extension of PRRTE/PMIx to cover ever broader ranges of use-cases, combined with fallback code paths in OMPI for when the information is not available, has

Direct launch

In the case of direct launch (i.e., launch by a host-provided launcher such as srun for Slurm), the * Per-job (step) daemon hosting PMIx server. This is the most common

* System daemon hosting PMIx server. This is less common in practice due to the security mismatch - the system daemon must operate at a privileged level, while the application is operating at a user level. However, there are scenarios where this may be permissible or even required. In such cases, the system daemon will expose an unfiltered view of the local topology to all applications executing on that node. This is essentially equivalent to the DVM launch mode described above, except that there is no guarantee that the host environment will provide all the information required by OMPI. Thus, it may be necessary to filter the topology in such cases.

singleton By definition, singletons execute without the support of any RTE. While technically they could connect to a system-level PMIx server, OMPI initializes application processes as PMIx "clients" and not "tools". Thus, the PMIx client library does not support discovery and connection to an arbitrary PMIx server - it requires that either the server identify itself via envars or that the application provide the necessary rendezvous information. Singletons, therefore, must discover and topology for themselves. If operating under external constraints (e.g., cgroups), the discovery will yield an appropriately constrained set of resources. Binding of the singleton (i.e., "self-binding") within those resources can be accomplished by setting an appropriate MCA parameter in the environment prior to execution.

Recommended practice

While there are flags to indicate if OMPI has been launched by its own RTE (whether mpirun or DVM), this in itself is not sufficient information to determine if the topology reflects the resources assigned to the application. The best method, therefore is to:

a. attempt to access the desired information directly from PMIx. In most cases, all OMPI-required information will have been provided. This includes relative process locality and device (NIC and GPU) distances between each process and their local devices. If found, this information accurately reflects the actual resource utilization/availability for the application, thereby removing the need to directly access the topology itself. This is the recommended practice

a. if the desired information is not available from PMIx, then one must turn to the topology for the answers.

Accessing the HWLOC topology tree

mpirun launch

DVM launch

Direct launch

Recommended practice

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!