Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--report-bindings on whole machine, plus numa markers #6760

Closed
wants to merge 2 commits into from

Conversation

markalle
Copy link
Contributor

@markalle markalle commented Jun 13, 2019

I combined a few related features here:

  1. --report-bindings on whole machine
  2. mark unallowed (eg cgroup) parts of the whole machine with ~
  3. numa markers
  4. allow hwloc tree to not have sockets/cores

  1. whole machine:

The incoming topology to the pretty-print functions probably didn't use the WHOLE_SYSTEM flag. In general we don't want WHOLE_SYSTEM, but for pretty printing I think it makes more sense. So I'm caching a WHOLE_SYSTEM load of the current machine and using it if the incoming topology also identifies as the current machine.

Examples of what pretty-printing looks like with/without whole system:

Suppose the machine is

    [..../..../..../....][..../..../..../....]
     0    4    8    12    16   20   24   28

and we run with

    cgset -r cpuset.cpus=24,25,28,29 mycgroup1
    cgset -r cpuset.cpus=26,27,30,31 mycgroup2
to leave only these hardware threads active:
    mycgroup1: [~~~~/~~~~/~~~~/~~~~][~~~~/~~~~/..~~/..~~]
    mycgroup2: [~~~~/~~~~/~~~~/~~~~][~~~~/~~~~/~~../~~..]

Without whole-system the printout (for both of the above) would be
(-np 2)

    MCW rank 0 bound to socket 1[core 0[hwt 0-1]]: [][BB/..]
    MCW rank 1 bound to socket 1[core 1[hwt 0-1]]: [][../BB]

With whole-system the output is this, which I think is more informative

mycgroup1 (-np 2):
    MCW rank 0 bound to socket 1[core 6[hwt 0-1]]: [~~~~/~~~~/~~~~/~~~~][~~~~/~~~~/BB~~/..~~]
    MCW rank 1 bound to socket 1[core 7[hwt 0-1]]: [~~~~/~~~~/~~~~/~~~~][~~~~/~~~~/..~~/BB~~]
mycgroup2 (-np 2):
    MCW rank 0 bound to socket 1[core 6[hwt 2-3]]: [~~~~/~~~~/~~~~/~~~~][~~~~/~~~~/~~BB/~~..]
    MCW rank 1 bound to socket 1[core 7[hwt 2-3]]: [~~~~/~~~~/~~~~/~~~~][~~~~/~~~~/~~../~~BB]
  1. mark unallowed (~)

When using the whole-machine option there's a bitmask available to identify
the allowed PUs, eg omitting PUs not in our cgroup. To distinguish those
PUs I'm using "~"

  1. numa markers (<>)

I like having numa markers as well as the existing separators between sockets and cores. They're a little harder since the numas are more fluid, eg sockets always contain cores not vice versa, so you can hard code a loop over sockets follwed by a loop over cores. But numas might be be above or below sockets in the tree.

This code identifies which level should be considered the child of the numas, and has each of the hard coded loops capable of adding numa markers.

Currently I don't have any tunable to turn off the numa markers. A lot of machines have fairly simple numa output where each socket contains one numa, and that ends up looking like this:
[<..../..../..../....>][<..../..../..../....>]
If others feel that's too cluttered I'm okay with having some tunable so people would have to ask for numa markers.

  1. allow hwloc tree to not have sockets/cores

I may be behind the times on hwloc development, but as far as I know hwloc trees aren't guaranteed to have sockets and cores, just a MACHINE at the top and PU at the bottom. So I added a little code to the loops so it would still print the PUs on a hypothetical machine that lacked any structuring of the PUs into cores/sockets.

Signed-off-by: Mark Allen markalle@us.ibm.com

@rhc54
Copy link
Contributor

rhc54 commented Jun 14, 2019

How do you know what the "whole machine" looks like on all the compute nodes? It appears that you are assuming all nodes are identical and so you can just mark any "disallowed" PUs on the remote machines based on this machine's local topology. That seems like an often inaccurate assumption as it depends on the cgroup settings, what processors are active on each node, etc.

It feels like you would have to get each daemon to report the "whole machine" to ensure you have a complete picture in order to get this right. We currently rely on our "topology signature" to detect whether or not a node is different from where mpirun is executing, but that might not be good enough to accurately detect a difference in this case (e.g., we only report the number of available CPUs, not which ones are available vs disallowed). Might take some playing around to determine what is required to meet this objective.

I'm also a bit disturbed about marking NUMA regions as there are some significant assumptions being made in the PR. The entire concept of "NUMA" is undergoing considerable change as memory layouts rapidly evolve. At the least, we should make it an option (not the default) as it may not even mean anything on upcoming systems.

I have no real objection to the idea of showing a broader picture of the machine - I'm just not sure you are accurately doing so with this PR.

@markalle
Copy link
Contributor Author

Shoot, I didn't trace the lifecycle of the incoming topology for those calls. If it's not the local machine then I agree this PR doesn't get us very far.

Without having looked at the specifics yet the design I'm leaning toward is

  • at topology_load time load without WHOLE_SYS as currently done, and load again with WHOLE_SYS and attach the WHOLE_SYS version into the userdata of the main version, that way the WHOLE_SYS is available but doesn't interfere because it's not being used for most things
  • make the signature take into account both the main topology + any WHOLE_SYS attached to it
  • wherever topologies are sent/recved double that code to handle the attached WHOLE_SYS (I haven't looked at it, does this use the hwloc XML import/export functions?)

It doesn't sound conceptually too hard but may be a bit of code

@rhc54
Copy link
Contributor

rhc54 commented Jun 14, 2019

You don't need to retain the WHOLE_SYS on the backend compute nodes since nothing back there can use it. What you need to do is:

  • generate a signature that somehow exposes what the pattern of available/disallowed PUs looks like. This signature gets passed back to mpirun in the "phone home" message towards the end of "daemon init". Remember, which PUs are allowed/disallowed isn't constant across the nodes - thus, it isn't enough to know the number, you have to know precisely which PUs fall into each category.

  • mpirun already compares its signature with the ones coming in from the remote nodes. When they don't match, it sends a message to that one node and asks it for a dump of its topology. This is where you could ask that it send the WHOLE_SYS version. The returned topology gets stored in mpirun and the node object's topology field is pointed to it.

When generating the binding output, we always use the topology for that specific node, and so you should then get the correct output.

@hppritcha
Copy link
Member

bot:ompi:retest

@gpaulsen
Copy link
Member

gpaulsen commented Jun 17, 2019

@hppritcha Looks like Cray aborted, due to build timeout:
https://jenkins.open-mpi.org/jenkins/job/open-mpi.build.platforms/Platform=cray_cle/4989/console

7. Running autotools on top-level tree

==> Remove stale files
==> Writing m4 file with autogen.pl results
==> Generating opal_get_version.sh
Running: autom4te --language=m4sh opal_get_version.m4sh -o opal_get_version.sh
==> Running autoreconf
Running: autoreconf -ivf --warnings=all,no-obsolete,no-override -I config
autoreconf: Entering directory `.'
autoreconf: configure.ac: not using Gettext
autoreconf: running: aclocal -I config --force --warnings=all,no-obsolete,no-override -I config
Build timed out (after 10 minutes). Marking the build as aborted.
Build was aborted
Finished: ABORTED

@markalle
Copy link
Contributor Author

I took a different approach just adding a new rank-level option
-mca report_bindings_full 1

I agree with the mpirun-level stuff using the regular non-whole-system topologies and collecting that for its signatures and its info about the workers. Expanding those to include WHOLE_SYSTEM seemed like a lot of extra data and data collection for a small-ish feature.

So I added an alternate method of printing bindings that happens at the rank-level, where the needed topologies are easily available.

The only practical downside I see is the usual --report-bindings was capable of printing bindings even if your app was "hostname" while my -mca report_bindings_full 1 is only going to reach its print statements if the app calls MPI_Init.

@rhc54
Copy link
Contributor

rhc54 commented Jun 17, 2019

You are kidding, yes? You are really going to have 1000s of ranks each dumping their own binding string across stdout?? You do realize that the output will all be interleaved and hard to read, yes?

I realize the alternative is a bit of work, but surely that is better than something effectively unusable.

@markalle
Copy link
Contributor Author

It's sent to rank 0 who does the printing. It's a big blob of stdout same as --report-bindings is already a big blob of output. It's more communication than the regular method because it has no compression to realize many ranks are sending the same data, but I don't think it's too excessive of a gatherv

@markalle
Copy link
Contributor Author

I pushed a topology signature update (second commit to this PR) that adds a compressed list of os_indexes for the PUs. This way two different cgroups that look the same based on their socket/core/PU/etc counts come in with different signatures, and the compressed lists tend to be pretty small.

I'm still mulling over whether to change back to modifying the original --report-bindings from the mpirun level. Currently this PR drops in a new rank-level report-bindings option that's independent of the old, so it doesn't have to worry about what topologies are collected by mpirun.

@bgoglin
Copy link
Contributor

bgoglin commented Jul 12, 2019

The compressed list of PU indexes should be identical to hwloc_bitmap_list_asprintf(&string, hwloc_topology_get_topology_cpuset(topology)). There's also a snprintf() if you prefer static buffers.

This PR contains both a change to the existing --report-bindings
feature, as well as a rank-level pretty printer that can be used
separately.  The rank-level may be an unnecessary addition, but
it doesn't hurt anything.

The behavior now is
1. --report-bindings  :  uses a cached WHOLE_SYSTEM to make the
                         strings it prints represent the binding
                         relative to the whole machine (examples
                         below)
2. --report-bindings with
   --mca rtc_hwloc_report_bindings_hwview 1
                      :  adds numa markers and omits the output
                         that looks like "socket 0[core 1[hwt 0-3]]"
3.  -mca report_bindings_full 1
                      :  an alternate rank-level pretty-printer

I initially thought item 3 above was needed because mpirun
was producing the --report-bindings output and it doesn't have
WHOLE_SYSTEM topologies from the remote hosts (and I'm keeping
that code partially because there may be multiple paths where
that's true for some of them), but the path I've traced involve
the strings being produced locally at the host who does know
its own topology, then opal_output() to be printed back at mpirun.

I combined a few related features here:
1. --report-bindings on whole machine
2. mark unallowed (eg cgroup) parts of the whole machine with ~
3. numa markers
4. allow hwloc tree to not have sockets/cores

Part 3 isn't turned on for the original --report-bindings feature.
If you add --mca rtc_hwloc_report_bindings_hwview 1 then the ouput
for --report-bindings is changed to include the numa markers.

1. whole machine:

The topology sent to the pretty-print functions usually doesn't have the
WHOLE_SYSTEM flag, and in general we don't want WHOLE_SYSTEM.  But for
pretty-printing I think it makes more sense, so the new uses a
WHOLE_SYSTEM topology in the pretty-print function.

Examples of what pretty-printing looks like with/without whole system:

Suppose the machine is
    [..../..../..../....][..../..../..../....]
     0    4    8    12    16   20   24   28
and we run with
    cgset -r cpuset.cpus=24,25,28,29 mycgroup1
    cgset -r cpuset.cpus=26,27,30,31 mycgroup2
to leave only these hardware threads active:
    mycgroup1: [~~~~/~~~~/~~~~/~~~~][~~~~/~~~~/..~~/..~~]
    mycgroup2: [~~~~/~~~~/~~~~/~~~~][~~~~/~~~~/~~../~~..]

Without whole-system the printout (for both of the above) would be
(-np 2)
    MCW rank 0 bound to socket 1[core 0[hwt 0-1]]: [][BB/..]
    MCW rank 1 bound to socket 1[core 1[hwt 0-1]]: [][../BB]

With whole-system the output is this, which I think is more informative
mycgroup1 (-np 2):
    MCW rank 0 bound to socket 1[core 6[hwt 0-1]]: [~~~~/~~~~/~~~~/~~~~][~~~~/~~~~/BB~~/..~~]
    MCW rank 1 bound to socket 1[core 7[hwt 0-1]]: [~~~~/~~~~/~~~~/~~~~][~~~~/~~~~/..~~/BB~~]
mycgroup2 (-np 2):
    MCW rank 0 bound to socket 1[core 6[hwt 2-3]]: [~~~~/~~~~/~~~~/~~~~][~~~~/~~~~/~~BB/~~..]
    MCW rank 1 bound to socket 1[core 7[hwt 2-3]]: [~~~~/~~~~/~~~~/~~~~][~~~~/~~~~/~~../~~BB]

2. mark unallowed (~)

When using the whole-machine option there's a bitmask available to identify
the allowed PUs, eg omitting PUs not in our cgroup.  To distinguish those
PUs I'm using "~"

3. numa markers (<>)

I like having numa markers as well as the existing separators between
sockets and cores.  They're a little harder since the numas are more fluid,
eg sockets always contain cores not vice versa, so you can hard code a loop
over sockets follwed by a loop over cores. But numas might be be above or
below sockets in the tree.

This code identifies which level should be considered the child of the
numas, and has each of the hard coded loops capable of adding numa markers.

Currently I don't have any tunable to turn off the numa markers. A lot of
machines have fairly simple numa output where each socket contains one numa,
and that ends up looking like this:
    [<..../..../..../....>][<..../..../..../....>]
If others feel that's too cluttered I'm okay with having some tunable so
people have to ask for numa markers.

4. allow hwloc tree to not have sockets/cores

I may be behind the times on hwloc development, but as far as I know
hwloc trees aren't guaranteed to have sockets and cores, just a MACHINE
at the top and PU at the bottom. So I added a little code to the loops
so it would still print the PUs on a hypothetical machine that lacked any
structuring of the PUs into cores/sockets.

Signed-off-by: Mark Allen <markalle@us.ibm.com>
When mpirun collects topologies it first asks for signatures and then
only asks a host to send its topologif the signature is something new.
But with heterogeneous cgroups the signature doesn't have enough data
to notice differences that matter.

An example of the previous signature on one of our machines is
    2N:2S:22L3:22L2:44L1:44C:176H:ppc64le:le
which is just a count of how many numa nodes, sockets, cores, PUs etc
the host has.  If I create a small cgroup with just 4 PUs from one of
the cores, an example of the new signature is
    1N:1S:1L3:1L2:1L1:1C:4H:ppc64le:le

But if I did that on two different machines and used different cores
on each, the signature would look the same. One way to distinguish
such signatures is to also list the available PUs by os_index (I might
be okay with logical indexes if it was logical under a WHOLE-SYSTEM
topology, but in the non-WHOLE-SYSTEM topologies we're using  I don't
think the logical indexes would be good enough).

This list could be large so I've included a couple different
compressions of it and use whichever comes out shorter.

The first is a comma separated list that compresses simple ranges like
10-15 and also detects patterns analogous to MPI_Type_vector(), eg
    {16,17,18,19,20,21,22,23} = 16-23
    {1,2,3,4,5,  11,12,13,14,15,  21,22,23,24,25} = 1-5+10*3
    {2,3, 10,11, 18,19, 26,27,
     102,103, 110,111, 118,119,
     200,201,202,203,204,205,206,207,208 } = 2-3+8*4,102-103+8*3,200-208
    {1,3,6,7,9,11,12} = 1,3,6,7,9,11,12

The second compression is a hex string containing the indexes, and
further shrunk by noticing if the same char is repeated 5+ times, so
    {16,17,18,19,20,21,22,23} = 0000ff
    {1,2,3,4,5,  11,12,13,14,15,  21,22,23,24,25} = 7c1f07c
    {2,3, 10,11, 18,19, 26,27,
     102,103, 110,111, 118,119,
     200,201,202,203,204,205,206,207,208 } = 30303030*18,303030*20,ff8
    {1,3,6,7,9,11,12} = 5358

So final signature strings end up things like

example with PUs 0-175:
    2N:2S:22L3:22L2:44L1:44C:176H:HXf*44:ppc64le:le
here "f*44" won the compression contest against "0-175"

example with PUs 12-15:
    1N:1S:1L3:1L2:1L1:1C:4H:HX000f:ppc64le:le
here "000f" won against "12-15"

example with every other block of 4 PUs:
    2N:2S:22L3:22L2:22L1:22C:88H:HL0-3+8*22:ppc64le:le
here "0-3+8*22" won against "f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f"

Signed-off-by: Mark Allen <markalle@us.ibm.com>
@markalle
Copy link
Contributor Author

That's a good looking function, but I don't think it does as much compression as mine. Eg noticing patterns analogous to MPI_Type_vector:

int a[] = {1,2,3,4,5, 11,12,13,14,15, 21,22,23,24,25};

Yours gives "1-5,11-15,21-25" where mine identifies the repetition as "1-5+10*3"

My code has an easier job though because I don't have to convert both ways, so I didn't bother writing a conversion function for the other direction

@markalle
Copy link
Contributor Author

I'm about to push again just to change the code for the --report-bindings path to use a cached WHOLE_SYSTEM topology. The earlier comments had me believing those strings were produced at mpirun, but they're not, at least not in the code path I've been tracing. So a simple caching of one WHOLE_SYSTEM topology to be used inside the pretty-print functions is all it takes to make --report-bindings pretty again.

I'm still leaving in my allternate path rank-level pretty printer as a backup in case there are other codepaths I'm not aware of, eg if it turns out my --report-bindings WHOLE_SYSTEM fix is only a partial fix.

@jsquyres
Copy link
Member

jsquyres commented Jul 29, 2019

So a simple caching of one WHOLE_SYSTEM topology to be used inside the pretty-print functions is all it takes to make --report-bindings pretty again.

@markalle I'm a little confused -- isn't the premise that the nodes in the system can be heterogeneous (i.e., more than one topology)?

@markalle
Copy link
Contributor Author

Yeah, but those --report-bindings output strings are being produced remotely, at the heterogeneous nodes. So each node only needs its own topology loaded WHOLE_SYSTEM to produce its pretty string, as opposed to the mpirun-level node needing everybody's topology in WHOLE_SYSTEM form.

Looking through all the "cset2mapstr" calls I think the ones that matter for --report-bindings output are all using opal_hwloc_topology. There are some cset2mapstr calls that operate on node->topology->topo for example and those calls would produce the less-pretty string, but I think those are things like affinity running at some verboseness setting, not the --report-bindings output which is where I think the pretty-print has the most value

@gpaulsen
Copy link
Member

@jsquyres @rhc54 @bwbarrett, Is there anything else for this PR? If not, lets get this in.

@rhc54
Copy link
Contributor

rhc54 commented Sep 10, 2019

I'd rather do this right by taking into account the hetero nature of the system that is inherent in this use-case. I'll try to put something together.

@markalle
Copy link
Contributor Author

I think it's handling at least one significant aspect of the heterogeneous nature in the second commit. That's the one that expands the signatures a bit so heterogeneous cgroups that might otherwise look the same will successfully produce different signatures and thus trigger mpirun to ask for topologies from those systems.

Or were you talking about some other aspect of the hetero cases that needs to be covered?

rhc54 added a commit to rhc54/prrte that referenced this pull request Dec 31, 2019
Provide --report-bindings output as per @markalle's description. This
carrys the functionality in the cited PR to PRRTE, modified to conform
to OMPI/PRRTE coding standards. I also eliminated having two different
ways of reporting the bindings so we don't confuse users.

Note that the PRRTE daemons now do intake the WHOLE_SYSTEM version of
the topology tree to ensure we are accurately reporting what cpus are
present on the node vs available to us. Daemons always use the summary
we create for each object when mapping, and the summary only contains
available cpus, so no harm done there.

Refs open-mpi/ompi#6760

Signed-off-by: Ralph Castain <rhc@pmix.org>
rhc54 added a commit to rhc54/prrte that referenced this pull request Dec 31, 2019
Provide --report-bindings output as per @markalle's description. This
carrys the functionality in the cited PR to PRRTE, modified to conform
to OMPI/PRRTE coding standards. I also eliminated having two different
ways of reporting the bindings so we don't confuse users.

Note that the PRRTE daemons now do intake the WHOLE_SYSTEM version of
the topology tree to ensure we are accurately reporting what cpus are
present on the node vs available to us. Daemons always use the summary
we create for each object when mapping, and the summary only contains
available cpus, so no harm done there.

Refs open-mpi/ompi#6760

Signed-off-by: Ralph Castain <rhc@pmix.org>
rhc54 added a commit to rhc54/prrte that referenced this pull request Dec 31, 2019
Provide --report-bindings output as per @markalle's description. This
carrys the functionality in the cited PR to PRRTE, modified to conform
to OMPI/PRRTE coding standards. I also eliminated having two different
ways of reporting the bindings so we don't confuse users.

Note that the PRRTE daemons now do intake the WHOLE_SYSTEM version of
the topology tree to ensure we are accurately reporting what cpus are
present on the node vs available to us. Daemons always use the summary
we create for each object when mapping, and the summary only contains
available cpus, so no harm done there.

Refs open-mpi/ompi#6760

Signed-off-by: Ralph Castain <rhc@pmix.org>
rhc54 added a commit to rhc54/prrte that referenced this pull request Dec 31, 2019
Provide --report-bindings output as per @markalle's description. This
carrys the functionality in the cited PR to PRRTE, modified to conform
to OMPI/PRRTE coding standards. I also eliminated having two different
ways of reporting the bindings so we don't confuse users, and extended
the original work to support HWLOC 1.x as well as 2.x.

Note that the PRRTE daemons now do intake the WHOLE_SYSTEM version of
the topology tree to ensure we are accurately reporting what cpus are
present on the node vs available to us. Daemons always use the summary
we create for each object when mapping, and the summary only contains
available cpus, so no harm done there.

Refs open-mpi/ompi#6760

Signed-off-by: Ralph Castain <rhc@pmix.org>
rhc54 added a commit to rhc54/prrte that referenced this pull request Jan 1, 2020
Provide --report-bindings output as per @markalle's description. This
ports the functionality in the cited PR to PRRTE, modified to conform
to OMPI/PRRTE coding standards. I also eliminated having two different
ways of reporting the bindings so we don't confuse users, and extended
the original work to support HWLOC 1.x as well as 2.x.

Note that the topology tree always includes the "complete" cpuset for
each object along with the allowed cpuset (in the "cpuset" field). So
printing the binding map with all processors shown simply requires using
the "complete" cpuset field - you don't need to collect the
"whole_system" topology.

For detection purposes, pass the "signature" containing the list form of
both the complete and cpuset fields. Only include the complete field if
the two differ - otherwise, just leave that field blank. We don't
actually use the signature for anything other than detecting that there
is a difference between two machines, so this will suffice - if two
machines have different cgroups, then we will treat the topologies as
different.

Refs open-mpi/ompi#6760

Signed-off-by: Ralph Castain <rhc@pmix.org>

dd

Signed-off-by: Ralph Castain <rhc@pmix.org>
@rhc54
Copy link
Contributor

rhc54 commented Apr 7, 2020

This has been dealt with in PRRTE and is no longer applicable to OMPI master

@rhc54 rhc54 closed this Apr 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants