--report-bindings on whole machine, plus numa markers #6760

markalle · 2019-06-13T22:46:51Z

I combined a few related features here:

--report-bindings on whole machine
mark unallowed (eg cgroup) parts of the whole machine with ~
numa markers
allow hwloc tree to not have sockets/cores

whole machine:

The incoming topology to the pretty-print functions probably didn't use the WHOLE_SYSTEM flag. In general we don't want WHOLE_SYSTEM, but for pretty printing I think it makes more sense. So I'm caching a WHOLE_SYSTEM load of the current machine and using it if the incoming topology also identifies as the current machine.

Examples of what pretty-printing looks like with/without whole system:

Suppose the machine is

    [..../..../..../....][..../..../..../....]
     0    4    8    12    16   20   24   28

and we run with

    cgset -r cpuset.cpus=24,25,28,29 mycgroup1
    cgset -r cpuset.cpus=26,27,30,31 mycgroup2
to leave only these hardware threads active:
    mycgroup1: [~~~~/~~~~/~~~~/~~~~][~~~~/~~~~/..~~/..~~]
    mycgroup2: [~~~~/~~~~/~~~~/~~~~][~~~~/~~~~/~~../~~..]

Without whole-system the printout (for both of the above) would be
(-np 2)

    MCW rank 0 bound to socket 1[core 0[hwt 0-1]]: [][BB/..]
    MCW rank 1 bound to socket 1[core 1[hwt 0-1]]: [][../BB]

With whole-system the output is this, which I think is more informative

mycgroup1 (-np 2):
    MCW rank 0 bound to socket 1[core 6[hwt 0-1]]: [~~~~/~~~~/~~~~/~~~~][~~~~/~~~~/BB~~/..~~]
    MCW rank 1 bound to socket 1[core 7[hwt 0-1]]: [~~~~/~~~~/~~~~/~~~~][~~~~/~~~~/..~~/BB~~]
mycgroup2 (-np 2):
    MCW rank 0 bound to socket 1[core 6[hwt 2-3]]: [~~~~/~~~~/~~~~/~~~~][~~~~/~~~~/~~BB/~~..]
    MCW rank 1 bound to socket 1[core 7[hwt 2-3]]: [~~~~/~~~~/~~~~/~~~~][~~~~/~~~~/~~../~~BB]

mark unallowed (~)

When using the whole-machine option there's a bitmask available to identify
the allowed PUs, eg omitting PUs not in our cgroup. To distinguish those
PUs I'm using "~"

numa markers (<>)

I like having numa markers as well as the existing separators between sockets and cores. They're a little harder since the numas are more fluid, eg sockets always contain cores not vice versa, so you can hard code a loop over sockets follwed by a loop over cores. But numas might be be above or below sockets in the tree.

This code identifies which level should be considered the child of the numas, and has each of the hard coded loops capable of adding numa markers.

Currently I don't have any tunable to turn off the numa markers. A lot of machines have fairly simple numa output where each socket contains one numa, and that ends up looking like this:
[<..../..../..../....>][<..../..../..../....>]
If others feel that's too cluttered I'm okay with having some tunable so people would have to ask for numa markers.

allow hwloc tree to not have sockets/cores

I may be behind the times on hwloc development, but as far as I know hwloc trees aren't guaranteed to have sockets and cores, just a MACHINE at the top and PU at the bottom. So I added a little code to the loops so it would still print the PUs on a hypothetical machine that lacked any structuring of the PUs into cores/sockets.

Signed-off-by: Mark Allen markalle@us.ibm.com

rhc54 · 2019-06-14T02:24:18Z

How do you know what the "whole machine" looks like on all the compute nodes? It appears that you are assuming all nodes are identical and so you can just mark any "disallowed" PUs on the remote machines based on this machine's local topology. That seems like an often inaccurate assumption as it depends on the cgroup settings, what processors are active on each node, etc.

It feels like you would have to get each daemon to report the "whole machine" to ensure you have a complete picture in order to get this right. We currently rely on our "topology signature" to detect whether or not a node is different from where mpirun is executing, but that might not be good enough to accurately detect a difference in this case (e.g., we only report the number of available CPUs, not which ones are available vs disallowed). Might take some playing around to determine what is required to meet this objective.

I'm also a bit disturbed about marking NUMA regions as there are some significant assumptions being made in the PR. The entire concept of "NUMA" is undergoing considerable change as memory layouts rapidly evolve. At the least, we should make it an option (not the default) as it may not even mean anything on upcoming systems.

I have no real objection to the idea of showing a broader picture of the machine - I'm just not sure you are accurately doing so with this PR.

markalle · 2019-06-14T05:22:08Z

Shoot, I didn't trace the lifecycle of the incoming topology for those calls. If it's not the local machine then I agree this PR doesn't get us very far.

Without having looked at the specifics yet the design I'm leaning toward is

at topology_load time load without WHOLE_SYS as currently done, and load again with WHOLE_SYS and attach the WHOLE_SYS version into the userdata of the main version, that way the WHOLE_SYS is available but doesn't interfere because it's not being used for most things
make the signature take into account both the main topology + any WHOLE_SYS attached to it
wherever topologies are sent/recved double that code to handle the attached WHOLE_SYS (I haven't looked at it, does this use the hwloc XML import/export functions?)

It doesn't sound conceptually too hard but may be a bit of code

rhc54 · 2019-06-14T14:45:38Z

You don't need to retain the WHOLE_SYS on the backend compute nodes since nothing back there can use it. What you need to do is:

generate a signature that somehow exposes what the pattern of available/disallowed PUs looks like. This signature gets passed back to mpirun in the "phone home" message towards the end of "daemon init". Remember, which PUs are allowed/disallowed isn't constant across the nodes - thus, it isn't enough to know the number, you have to know precisely which PUs fall into each category.
mpirun already compares its signature with the ones coming in from the remote nodes. When they don't match, it sends a message to that one node and asks it for a dump of its topology. This is where you could ask that it send the WHOLE_SYS version. The returned topology gets stored in mpirun and the node object's topology field is pointed to it.

When generating the binding output, we always use the topology for that specific node, and so you should then get the correct output.

hppritcha · 2019-06-17T19:04:16Z

bot:ompi:retest

gpaulsen · 2019-06-17T22:01:06Z

@hppritcha Looks like Cray aborted, due to build timeout:
https://jenkins.open-mpi.org/jenkins/job/open-mpi.build.platforms/Platform=cray_cle/4989/console

7. Running autotools on top-level tree

==> Remove stale files
==> Writing m4 file with autogen.pl results
==> Generating opal_get_version.sh
Running: autom4te --language=m4sh opal_get_version.m4sh -o opal_get_version.sh
==> Running autoreconf
Running: autoreconf -ivf --warnings=all,no-obsolete,no-override -I config
autoreconf: Entering directory `.'
autoreconf: configure.ac: not using Gettext
autoreconf: running: aclocal -I config --force --warnings=all,no-obsolete,no-override -I config
Build timed out (after 10 minutes). Marking the build as aborted.
Build was aborted
Finished: ABORTED

markalle · 2019-06-17T22:28:52Z

I took a different approach just adding a new rank-level option
-mca report_bindings_full 1

I agree with the mpirun-level stuff using the regular non-whole-system topologies and collecting that for its signatures and its info about the workers. Expanding those to include WHOLE_SYSTEM seemed like a lot of extra data and data collection for a small-ish feature.

So I added an alternate method of printing bindings that happens at the rank-level, where the needed topologies are easily available.

The only practical downside I see is the usual --report-bindings was capable of printing bindings even if your app was "hostname" while my -mca report_bindings_full 1 is only going to reach its print statements if the app calls MPI_Init.

rhc54 · 2019-06-17T22:33:03Z

You are kidding, yes? You are really going to have 1000s of ranks each dumping their own binding string across stdout?? You do realize that the output will all be interleaved and hard to read, yes?

I realize the alternative is a bit of work, but surely that is better than something effectively unusable.

markalle · 2019-06-17T22:40:58Z

It's sent to rank 0 who does the printing. It's a big blob of stdout same as --report-bindings is already a big blob of output. It's more communication than the regular method because it has no compression to realize many ranks are sending the same data, but I don't think it's too excessive of a gatherv

markalle · 2019-07-11T22:18:28Z

I pushed a topology signature update (second commit to this PR) that adds a compressed list of os_indexes for the PUs. This way two different cgroups that look the same based on their socket/core/PU/etc counts come in with different signatures, and the compressed lists tend to be pretty small.

I'm still mulling over whether to change back to modifying the original --report-bindings from the mpirun level. Currently this PR drops in a new rank-level report-bindings option that's independent of the old, so it doesn't have to worry about what topologies are collected by mpirun.

bgoglin · 2019-07-12T05:55:26Z

The compressed list of PU indexes should be identical to hwloc_bitmap_list_asprintf(&string, hwloc_topology_get_topology_cpuset(topology)). There's also a snprintf() if you prefer static buffers.

This PR contains both a change to the existing --report-bindings feature, as well as a rank-level pretty printer that can be used separately. The rank-level may be an unnecessary addition, but it doesn't hurt anything. The behavior now is 1. --report-bindings : uses a cached WHOLE_SYSTEM to make the strings it prints represent the binding relative to the whole machine (examples below) 2. --report-bindings with --mca rtc_hwloc_report_bindings_hwview 1 : adds numa markers and omits the output that looks like "socket 0[core 1[hwt 0-3]]" 3. -mca report_bindings_full 1 : an alternate rank-level pretty-printer I initially thought item 3 above was needed because mpirun was producing the --report-bindings output and it doesn't have WHOLE_SYSTEM topologies from the remote hosts (and I'm keeping that code partially because there may be multiple paths where that's true for some of them), but the path I've traced involve the strings being produced locally at the host who does know its own topology, then opal_output() to be printed back at mpirun. I combined a few related features here: 1. --report-bindings on whole machine 2. mark unallowed (eg cgroup) parts of the whole machine with ~ 3. numa markers 4. allow hwloc tree to not have sockets/cores Part 3 isn't turned on for the original --report-bindings feature. If you add --mca rtc_hwloc_report_bindings_hwview 1 then the ouput for --report-bindings is changed to include the numa markers. 1. whole machine: The topology sent to the pretty-print functions usually doesn't have the WHOLE_SYSTEM flag, and in general we don't want WHOLE_SYSTEM. But for pretty-printing I think it makes more sense, so the new uses a WHOLE_SYSTEM topology in the pretty-print function. Examples of what pretty-printing looks like with/without whole system: Suppose the machine is [..../..../..../....][..../..../..../....] 0 4 8 12 16 20 24 28 and we run with cgset -r cpuset.cpus=24,25,28,29 mycgroup1 cgset -r cpuset.cpus=26,27,30,31 mycgroup2 to leave only these hardware threads active: mycgroup1: [~~~~/~~~~/~~~~/~~~~][~~~~/~~~~/..~~/..~~] mycgroup2: [~~~~/~~~~/~~~~/~~~~][~~~~/~~~~/~~../~~..] Without whole-system the printout (for both of the above) would be (-np 2) MCW rank 0 bound to socket 1[core 0[hwt 0-1]]: [][BB/..] MCW rank 1 bound to socket 1[core 1[hwt 0-1]]: [][../BB] With whole-system the output is this, which I think is more informative mycgroup1 (-np 2): MCW rank 0 bound to socket 1[core 6[hwt 0-1]]: [~~~~/~~~~/~~~~/~~~~][~~~~/~~~~/BB~~/..~~] MCW rank 1 bound to socket 1[core 7[hwt 0-1]]: [~~~~/~~~~/~~~~/~~~~][~~~~/~~~~/..~~/BB~~] mycgroup2 (-np 2): MCW rank 0 bound to socket 1[core 6[hwt 2-3]]: [~~~~/~~~~/~~~~/~~~~][~~~~/~~~~/~~BB/~~..] MCW rank 1 bound to socket 1[core 7[hwt 2-3]]: [~~~~/~~~~/~~~~/~~~~][~~~~/~~~~/~~../~~BB] 2. mark unallowed (~) When using the whole-machine option there's a bitmask available to identify the allowed PUs, eg omitting PUs not in our cgroup. To distinguish those PUs I'm using "~" 3. numa markers (<>) I like having numa markers as well as the existing separators between sockets and cores. They're a little harder since the numas are more fluid, eg sockets always contain cores not vice versa, so you can hard code a loop over sockets follwed by a loop over cores. But numas might be be above or below sockets in the tree. This code identifies which level should be considered the child of the numas, and has each of the hard coded loops capable of adding numa markers. Currently I don't have any tunable to turn off the numa markers. A lot of machines have fairly simple numa output where each socket contains one numa, and that ends up looking like this: [<..../..../..../....>][<..../..../..../....>] If others feel that's too cluttered I'm okay with having some tunable so people have to ask for numa markers. 4. allow hwloc tree to not have sockets/cores I may be behind the times on hwloc development, but as far as I know hwloc trees aren't guaranteed to have sockets and cores, just a MACHINE at the top and PU at the bottom. So I added a little code to the loops so it would still print the PUs on a hypothetical machine that lacked any structuring of the PUs into cores/sockets. Signed-off-by: Mark Allen <markalle@us.ibm.com>

When mpirun collects topologies it first asks for signatures and then only asks a host to send its topologif the signature is something new. But with heterogeneous cgroups the signature doesn't have enough data to notice differences that matter. An example of the previous signature on one of our machines is 2N:2S:22L3:22L2:44L1:44C:176H:ppc64le:le which is just a count of how many numa nodes, sockets, cores, PUs etc the host has. If I create a small cgroup with just 4 PUs from one of the cores, an example of the new signature is 1N:1S:1L3:1L2:1L1:1C:4H:ppc64le:le But if I did that on two different machines and used different cores on each, the signature would look the same. One way to distinguish such signatures is to also list the available PUs by os_index (I might be okay with logical indexes if it was logical under a WHOLE-SYSTEM topology, but in the non-WHOLE-SYSTEM topologies we're using I don't think the logical indexes would be good enough). This list could be large so I've included a couple different compressions of it and use whichever comes out shorter. The first is a comma separated list that compresses simple ranges like 10-15 and also detects patterns analogous to MPI_Type_vector(), eg {16,17,18,19,20,21,22,23} = 16-23 {1,2,3,4,5, 11,12,13,14,15, 21,22,23,24,25} = 1-5+10*3 {2,3, 10,11, 18,19, 26,27, 102,103, 110,111, 118,119, 200,201,202,203,204,205,206,207,208 } = 2-3+8*4,102-103+8*3,200-208 {1,3,6,7,9,11,12} = 1,3,6,7,9,11,12 The second compression is a hex string containing the indexes, and further shrunk by noticing if the same char is repeated 5+ times, so {16,17,18,19,20,21,22,23} = 0000ff {1,2,3,4,5, 11,12,13,14,15, 21,22,23,24,25} = 7c1f07c {2,3, 10,11, 18,19, 26,27, 102,103, 110,111, 118,119, 200,201,202,203,204,205,206,207,208 } = 30303030*18,303030*20,ff8 {1,3,6,7,9,11,12} = 5358 So final signature strings end up things like example with PUs 0-175: 2N:2S:22L3:22L2:44L1:44C:176H:HXf*44:ppc64le:le here "f*44" won the compression contest against "0-175" example with PUs 12-15: 1N:1S:1L3:1L2:1L1:1C:4H:HX000f:ppc64le:le here "000f" won against "12-15" example with every other block of 4 PUs: 2N:2S:22L3:22L2:22L1:22C:88H:HL0-3+8*22:ppc64le:le here "0-3+8*22" won against "f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f" Signed-off-by: Mark Allen <markalle@us.ibm.com>

markalle · 2019-07-12T18:13:02Z

That's a good looking function, but I don't think it does as much compression as mine. Eg noticing patterns analogous to MPI_Type_vector:

int a[] = {1,2,3,4,5, 11,12,13,14,15, 21,22,23,24,25};

Yours gives "1-5,11-15,21-25" where mine identifies the repetition as "1-5+10*3"

My code has an easier job though because I don't have to convert both ways, so I didn't bother writing a conversion function for the other direction

markalle · 2019-07-12T18:16:55Z

I'm about to push again just to change the code for the --report-bindings path to use a cached WHOLE_SYSTEM topology. The earlier comments had me believing those strings were produced at mpirun, but they're not, at least not in the code path I've been tracing. So a simple caching of one WHOLE_SYSTEM topology to be used inside the pretty-print functions is all it takes to make --report-bindings pretty again.

I'm still leaving in my allternate path rank-level pretty printer as a backup in case there are other codepaths I'm not aware of, eg if it turns out my --report-bindings WHOLE_SYSTEM fix is only a partial fix.

jsquyres · 2019-07-29T19:29:44Z

So a simple caching of one WHOLE_SYSTEM topology to be used inside the pretty-print functions is all it takes to make --report-bindings pretty again.

@markalle I'm a little confused -- isn't the premise that the nodes in the system can be heterogeneous (i.e., more than one topology)?

markalle · 2019-08-27T20:08:41Z

Yeah, but those --report-bindings output strings are being produced remotely, at the heterogeneous nodes. So each node only needs its own topology loaded WHOLE_SYSTEM to produce its pretty string, as opposed to the mpirun-level node needing everybody's topology in WHOLE_SYSTEM form.

Looking through all the "cset2mapstr" calls I think the ones that matter for --report-bindings output are all using opal_hwloc_topology. There are some cset2mapstr calls that operate on node->topology->topo for example and those calls would produce the less-pretty string, but I think those are things like affinity running at some verboseness setting, not the --report-bindings output which is where I think the pretty-print has the most value

gpaulsen · 2019-09-10T21:36:01Z

@jsquyres @rhc54 @bwbarrett, Is there anything else for this PR? If not, lets get this in.

rhc54 · 2019-09-10T21:47:22Z

I'd rather do this right by taking into account the hetero nature of the system that is inherent in this use-case. I'll try to put something together.

markalle · 2019-09-16T21:45:43Z

I think it's handling at least one significant aspect of the heterogeneous nature in the second commit. That's the one that expands the signatures a bit so heterogeneous cgroups that might otherwise look the same will successfully produce different signatures and thus trigger mpirun to ask for topologies from those systems.

Or were you talking about some other aspect of the hetero cases that needs to be covered?

@markalle

Provide --report-bindings output as per @markalle's description. This carrys the functionality in the cited PR to PRRTE, modified to conform to OMPI/PRRTE coding standards. I also eliminated having two different ways of reporting the bindings so we don't confuse users. Note that the PRRTE daemons now do intake the WHOLE_SYSTEM version of the topology tree to ensure we are accurately reporting what cpus are present on the node vs available to us. Daemons always use the summary we create for each object when mapping, and the summary only contains available cpus, so no harm done there. Refs open-mpi/ompi#6760 Signed-off-by: Ralph Castain <rhc@pmix.org>

@markalle

Provide --report-bindings output as per @markalle's description. This carrys the functionality in the cited PR to PRRTE, modified to conform to OMPI/PRRTE coding standards. I also eliminated having two different ways of reporting the bindings so we don't confuse users. Note that the PRRTE daemons now do intake the WHOLE_SYSTEM version of the topology tree to ensure we are accurately reporting what cpus are present on the node vs available to us. Daemons always use the summary we create for each object when mapping, and the summary only contains available cpus, so no harm done there. Refs open-mpi/ompi#6760 Signed-off-by: Ralph Castain <rhc@pmix.org>

@markalle

Provide --report-bindings output as per @markalle's description. This carrys the functionality in the cited PR to PRRTE, modified to conform to OMPI/PRRTE coding standards. I also eliminated having two different ways of reporting the bindings so we don't confuse users. Note that the PRRTE daemons now do intake the WHOLE_SYSTEM version of the topology tree to ensure we are accurately reporting what cpus are present on the node vs available to us. Daemons always use the summary we create for each object when mapping, and the summary only contains available cpus, so no harm done there. Refs open-mpi/ompi#6760 Signed-off-by: Ralph Castain <rhc@pmix.org>

@markalle

Provide --report-bindings output as per @markalle's description. This carrys the functionality in the cited PR to PRRTE, modified to conform to OMPI/PRRTE coding standards. I also eliminated having two different ways of reporting the bindings so we don't confuse users, and extended the original work to support HWLOC 1.x as well as 2.x. Note that the PRRTE daemons now do intake the WHOLE_SYSTEM version of the topology tree to ensure we are accurately reporting what cpus are present on the node vs available to us. Daemons always use the summary we create for each object when mapping, and the summary only contains available cpus, so no harm done there. Refs open-mpi/ompi#6760 Signed-off-by: Ralph Castain <rhc@pmix.org>

@markalle

Provide --report-bindings output as per @markalle's description. This ports the functionality in the cited PR to PRRTE, modified to conform to OMPI/PRRTE coding standards. I also eliminated having two different ways of reporting the bindings so we don't confuse users, and extended the original work to support HWLOC 1.x as well as 2.x. Note that the topology tree always includes the "complete" cpuset for each object along with the allowed cpuset (in the "cpuset" field). So printing the binding map with all processors shown simply requires using the "complete" cpuset field - you don't need to collect the "whole_system" topology. For detection purposes, pass the "signature" containing the list form of both the complete and cpuset fields. Only include the complete field if the two differ - otherwise, just leave that field blank. We don't actually use the signature for anything other than detecting that there is a difference between two machines, so this will suffice - if two machines have different cgroups, then we will treat the topologies as different. Refs open-mpi/ompi#6760 Signed-off-by: Ralph Castain <rhc@pmix.org> dd Signed-off-by: Ralph Castain <rhc@pmix.org>

rhc54 · 2020-04-07T14:46:27Z

This has been dealt with in PRRTE and is no longer applicable to OMPI master

markalle force-pushed the report_bindings branch from c1bed7b to ca59d47 Compare June 17, 2019 22:19

markalle added 2 commits July 12, 2019 13:43

markalle force-pushed the report_bindings branch from 25c3984 to 42fed5d Compare July 12, 2019 18:30

bwbarrett added the Target: main label Jul 23, 2019

gpaulsen requested a review from ggouaillardet September 5, 2019 20:45

rhc54 mentioned this pull request Dec 31, 2019

Update report of bindings openpmix/prrte#283

Merged

rhc54 closed this Apr 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

--report-bindings on whole machine, plus numa markers #6760

--report-bindings on whole machine, plus numa markers #6760

markalle commented Jun 13, 2019 •

edited

Loading

rhc54 commented Jun 14, 2019

markalle commented Jun 14, 2019

rhc54 commented Jun 14, 2019

hppritcha commented Jun 17, 2019

gpaulsen commented Jun 17, 2019 •

edited

Loading

markalle commented Jun 17, 2019

rhc54 commented Jun 17, 2019

markalle commented Jun 17, 2019

markalle commented Jul 11, 2019

bgoglin commented Jul 12, 2019

markalle commented Jul 12, 2019

markalle commented Jul 12, 2019

jsquyres commented Jul 29, 2019 •

edited

Loading

markalle commented Aug 27, 2019

gpaulsen commented Sep 10, 2019

rhc54 commented Sep 10, 2019

markalle commented Sep 16, 2019

rhc54 commented Apr 7, 2020

--report-bindings on whole machine, plus numa markers #6760

--report-bindings on whole machine, plus numa markers #6760

Conversation

markalle commented Jun 13, 2019 • edited Loading

rhc54 commented Jun 14, 2019

markalle commented Jun 14, 2019

rhc54 commented Jun 14, 2019

hppritcha commented Jun 17, 2019

gpaulsen commented Jun 17, 2019 • edited Loading

markalle commented Jun 17, 2019

rhc54 commented Jun 17, 2019

markalle commented Jun 17, 2019

markalle commented Jul 11, 2019

bgoglin commented Jul 12, 2019

markalle commented Jul 12, 2019

markalle commented Jul 12, 2019

jsquyres commented Jul 29, 2019 • edited Loading

markalle commented Aug 27, 2019

gpaulsen commented Sep 10, 2019

rhc54 commented Sep 10, 2019

markalle commented Sep 16, 2019

rhc54 commented Apr 7, 2020

markalle commented Jun 13, 2019 •

edited

Loading

gpaulsen commented Jun 17, 2019 •

edited

Loading

jsquyres commented Jul 29, 2019 •

edited

Loading