-
Notifications
You must be signed in to change notification settings - Fork 860
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
--report-bindings on whole machine, plus numa markers #6760
Conversation
How do you know what the "whole machine" looks like on all the compute nodes? It appears that you are assuming all nodes are identical and so you can just mark any "disallowed" PUs on the remote machines based on this machine's local topology. That seems like an often inaccurate assumption as it depends on the cgroup settings, what processors are active on each node, etc. It feels like you would have to get each daemon to report the "whole machine" to ensure you have a complete picture in order to get this right. We currently rely on our "topology signature" to detect whether or not a node is different from where mpirun is executing, but that might not be good enough to accurately detect a difference in this case (e.g., we only report the number of available CPUs, not which ones are available vs disallowed). Might take some playing around to determine what is required to meet this objective. I'm also a bit disturbed about marking NUMA regions as there are some significant assumptions being made in the PR. The entire concept of "NUMA" is undergoing considerable change as memory layouts rapidly evolve. At the least, we should make it an option (not the default) as it may not even mean anything on upcoming systems. I have no real objection to the idea of showing a broader picture of the machine - I'm just not sure you are accurately doing so with this PR. |
Shoot, I didn't trace the lifecycle of the incoming topology for those calls. If it's not the local machine then I agree this PR doesn't get us very far. Without having looked at the specifics yet the design I'm leaning toward is
It doesn't sound conceptually too hard but may be a bit of code |
You don't need to retain the WHOLE_SYS on the backend compute nodes since nothing back there can use it. What you need to do is:
When generating the binding output, we always use the topology for that specific node, and so you should then get the correct output. |
bot:ompi:retest |
@hppritcha Looks like Cray aborted, due to build timeout:
|
c1bed7b
to
ca59d47
Compare
I took a different approach just adding a new rank-level option I agree with the mpirun-level stuff using the regular non-whole-system topologies and collecting that for its signatures and its info about the workers. Expanding those to include WHOLE_SYSTEM seemed like a lot of extra data and data collection for a small-ish feature. So I added an alternate method of printing bindings that happens at the rank-level, where the needed topologies are easily available. The only practical downside I see is the usual --report-bindings was capable of printing bindings even if your app was "hostname" while my -mca report_bindings_full 1 is only going to reach its print statements if the app calls MPI_Init. |
You are kidding, yes? You are really going to have 1000s of ranks each dumping their own binding string across stdout?? You do realize that the output will all be interleaved and hard to read, yes? I realize the alternative is a bit of work, but surely that is better than something effectively unusable. |
It's sent to rank 0 who does the printing. It's a big blob of stdout same as --report-bindings is already a big blob of output. It's more communication than the regular method because it has no compression to realize many ranks are sending the same data, but I don't think it's too excessive of a gatherv |
I pushed a topology signature update (second commit to this PR) that adds a compressed list of os_indexes for the PUs. This way two different cgroups that look the same based on their socket/core/PU/etc counts come in with different signatures, and the compressed lists tend to be pretty small. I'm still mulling over whether to change back to modifying the original --report-bindings from the mpirun level. Currently this PR drops in a new rank-level report-bindings option that's independent of the old, so it doesn't have to worry about what topologies are collected by mpirun. |
The compressed list of PU indexes should be identical to hwloc_bitmap_list_asprintf(&string, hwloc_topology_get_topology_cpuset(topology)). There's also a snprintf() if you prefer static buffers. |
This PR contains both a change to the existing --report-bindings feature, as well as a rank-level pretty printer that can be used separately. The rank-level may be an unnecessary addition, but it doesn't hurt anything. The behavior now is 1. --report-bindings : uses a cached WHOLE_SYSTEM to make the strings it prints represent the binding relative to the whole machine (examples below) 2. --report-bindings with --mca rtc_hwloc_report_bindings_hwview 1 : adds numa markers and omits the output that looks like "socket 0[core 1[hwt 0-3]]" 3. -mca report_bindings_full 1 : an alternate rank-level pretty-printer I initially thought item 3 above was needed because mpirun was producing the --report-bindings output and it doesn't have WHOLE_SYSTEM topologies from the remote hosts (and I'm keeping that code partially because there may be multiple paths where that's true for some of them), but the path I've traced involve the strings being produced locally at the host who does know its own topology, then opal_output() to be printed back at mpirun. I combined a few related features here: 1. --report-bindings on whole machine 2. mark unallowed (eg cgroup) parts of the whole machine with ~ 3. numa markers 4. allow hwloc tree to not have sockets/cores Part 3 isn't turned on for the original --report-bindings feature. If you add --mca rtc_hwloc_report_bindings_hwview 1 then the ouput for --report-bindings is changed to include the numa markers. 1. whole machine: The topology sent to the pretty-print functions usually doesn't have the WHOLE_SYSTEM flag, and in general we don't want WHOLE_SYSTEM. But for pretty-printing I think it makes more sense, so the new uses a WHOLE_SYSTEM topology in the pretty-print function. Examples of what pretty-printing looks like with/without whole system: Suppose the machine is [..../..../..../....][..../..../..../....] 0 4 8 12 16 20 24 28 and we run with cgset -r cpuset.cpus=24,25,28,29 mycgroup1 cgset -r cpuset.cpus=26,27,30,31 mycgroup2 to leave only these hardware threads active: mycgroup1: [~~~~/~~~~/~~~~/~~~~][~~~~/~~~~/..~~/..~~] mycgroup2: [~~~~/~~~~/~~~~/~~~~][~~~~/~~~~/~~../~~..] Without whole-system the printout (for both of the above) would be (-np 2) MCW rank 0 bound to socket 1[core 0[hwt 0-1]]: [][BB/..] MCW rank 1 bound to socket 1[core 1[hwt 0-1]]: [][../BB] With whole-system the output is this, which I think is more informative mycgroup1 (-np 2): MCW rank 0 bound to socket 1[core 6[hwt 0-1]]: [~~~~/~~~~/~~~~/~~~~][~~~~/~~~~/BB~~/..~~] MCW rank 1 bound to socket 1[core 7[hwt 0-1]]: [~~~~/~~~~/~~~~/~~~~][~~~~/~~~~/..~~/BB~~] mycgroup2 (-np 2): MCW rank 0 bound to socket 1[core 6[hwt 2-3]]: [~~~~/~~~~/~~~~/~~~~][~~~~/~~~~/~~BB/~~..] MCW rank 1 bound to socket 1[core 7[hwt 2-3]]: [~~~~/~~~~/~~~~/~~~~][~~~~/~~~~/~~../~~BB] 2. mark unallowed (~) When using the whole-machine option there's a bitmask available to identify the allowed PUs, eg omitting PUs not in our cgroup. To distinguish those PUs I'm using "~" 3. numa markers (<>) I like having numa markers as well as the existing separators between sockets and cores. They're a little harder since the numas are more fluid, eg sockets always contain cores not vice versa, so you can hard code a loop over sockets follwed by a loop over cores. But numas might be be above or below sockets in the tree. This code identifies which level should be considered the child of the numas, and has each of the hard coded loops capable of adding numa markers. Currently I don't have any tunable to turn off the numa markers. A lot of machines have fairly simple numa output where each socket contains one numa, and that ends up looking like this: [<..../..../..../....>][<..../..../..../....>] If others feel that's too cluttered I'm okay with having some tunable so people have to ask for numa markers. 4. allow hwloc tree to not have sockets/cores I may be behind the times on hwloc development, but as far as I know hwloc trees aren't guaranteed to have sockets and cores, just a MACHINE at the top and PU at the bottom. So I added a little code to the loops so it would still print the PUs on a hypothetical machine that lacked any structuring of the PUs into cores/sockets. Signed-off-by: Mark Allen <markalle@us.ibm.com>
When mpirun collects topologies it first asks for signatures and then only asks a host to send its topologif the signature is something new. But with heterogeneous cgroups the signature doesn't have enough data to notice differences that matter. An example of the previous signature on one of our machines is 2N:2S:22L3:22L2:44L1:44C:176H:ppc64le:le which is just a count of how many numa nodes, sockets, cores, PUs etc the host has. If I create a small cgroup with just 4 PUs from one of the cores, an example of the new signature is 1N:1S:1L3:1L2:1L1:1C:4H:ppc64le:le But if I did that on two different machines and used different cores on each, the signature would look the same. One way to distinguish such signatures is to also list the available PUs by os_index (I might be okay with logical indexes if it was logical under a WHOLE-SYSTEM topology, but in the non-WHOLE-SYSTEM topologies we're using I don't think the logical indexes would be good enough). This list could be large so I've included a couple different compressions of it and use whichever comes out shorter. The first is a comma separated list that compresses simple ranges like 10-15 and also detects patterns analogous to MPI_Type_vector(), eg {16,17,18,19,20,21,22,23} = 16-23 {1,2,3,4,5, 11,12,13,14,15, 21,22,23,24,25} = 1-5+10*3 {2,3, 10,11, 18,19, 26,27, 102,103, 110,111, 118,119, 200,201,202,203,204,205,206,207,208 } = 2-3+8*4,102-103+8*3,200-208 {1,3,6,7,9,11,12} = 1,3,6,7,9,11,12 The second compression is a hex string containing the indexes, and further shrunk by noticing if the same char is repeated 5+ times, so {16,17,18,19,20,21,22,23} = 0000ff {1,2,3,4,5, 11,12,13,14,15, 21,22,23,24,25} = 7c1f07c {2,3, 10,11, 18,19, 26,27, 102,103, 110,111, 118,119, 200,201,202,203,204,205,206,207,208 } = 30303030*18,303030*20,ff8 {1,3,6,7,9,11,12} = 5358 So final signature strings end up things like example with PUs 0-175: 2N:2S:22L3:22L2:44L1:44C:176H:HXf*44:ppc64le:le here "f*44" won the compression contest against "0-175" example with PUs 12-15: 1N:1S:1L3:1L2:1L1:1C:4H:HX000f:ppc64le:le here "000f" won against "12-15" example with every other block of 4 PUs: 2N:2S:22L3:22L2:22L1:22C:88H:HL0-3+8*22:ppc64le:le here "0-3+8*22" won against "f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f" Signed-off-by: Mark Allen <markalle@us.ibm.com>
That's a good looking function, but I don't think it does as much compression as mine. Eg noticing patterns analogous to MPI_Type_vector:
Yours gives "1-5,11-15,21-25" where mine identifies the repetition as "1-5+10*3" My code has an easier job though because I don't have to convert both ways, so I didn't bother writing a conversion function for the other direction |
I'm about to push again just to change the code for the --report-bindings path to use a cached WHOLE_SYSTEM topology. The earlier comments had me believing those strings were produced at mpirun, but they're not, at least not in the code path I've been tracing. So a simple caching of one WHOLE_SYSTEM topology to be used inside the pretty-print functions is all it takes to make --report-bindings pretty again. I'm still leaving in my allternate path rank-level pretty printer as a backup in case there are other codepaths I'm not aware of, eg if it turns out my --report-bindings WHOLE_SYSTEM fix is only a partial fix. |
25c3984
to
42fed5d
Compare
@markalle I'm a little confused -- isn't the premise that the nodes in the system can be heterogeneous (i.e., more than one topology)? |
Yeah, but those --report-bindings output strings are being produced remotely, at the heterogeneous nodes. So each node only needs its own topology loaded WHOLE_SYSTEM to produce its pretty string, as opposed to the mpirun-level node needing everybody's topology in WHOLE_SYSTEM form. Looking through all the "cset2mapstr" calls I think the ones that matter for --report-bindings output are all using opal_hwloc_topology. There are some cset2mapstr calls that operate on node->topology->topo for example and those calls would produce the less-pretty string, but I think those are things like affinity running at some verboseness setting, not the --report-bindings output which is where I think the pretty-print has the most value |
@jsquyres @rhc54 @bwbarrett, Is there anything else for this PR? If not, lets get this in. |
I'd rather do this right by taking into account the hetero nature of the system that is inherent in this use-case. I'll try to put something together. |
I think it's handling at least one significant aspect of the heterogeneous nature in the second commit. That's the one that expands the signatures a bit so heterogeneous cgroups that might otherwise look the same will successfully produce different signatures and thus trigger mpirun to ask for topologies from those systems. Or were you talking about some other aspect of the hetero cases that needs to be covered? |
Provide --report-bindings output as per @markalle's description. This carrys the functionality in the cited PR to PRRTE, modified to conform to OMPI/PRRTE coding standards. I also eliminated having two different ways of reporting the bindings so we don't confuse users. Note that the PRRTE daemons now do intake the WHOLE_SYSTEM version of the topology tree to ensure we are accurately reporting what cpus are present on the node vs available to us. Daemons always use the summary we create for each object when mapping, and the summary only contains available cpus, so no harm done there. Refs open-mpi/ompi#6760 Signed-off-by: Ralph Castain <rhc@pmix.org>
Provide --report-bindings output as per @markalle's description. This carrys the functionality in the cited PR to PRRTE, modified to conform to OMPI/PRRTE coding standards. I also eliminated having two different ways of reporting the bindings so we don't confuse users. Note that the PRRTE daemons now do intake the WHOLE_SYSTEM version of the topology tree to ensure we are accurately reporting what cpus are present on the node vs available to us. Daemons always use the summary we create for each object when mapping, and the summary only contains available cpus, so no harm done there. Refs open-mpi/ompi#6760 Signed-off-by: Ralph Castain <rhc@pmix.org>
Provide --report-bindings output as per @markalle's description. This carrys the functionality in the cited PR to PRRTE, modified to conform to OMPI/PRRTE coding standards. I also eliminated having two different ways of reporting the bindings so we don't confuse users. Note that the PRRTE daemons now do intake the WHOLE_SYSTEM version of the topology tree to ensure we are accurately reporting what cpus are present on the node vs available to us. Daemons always use the summary we create for each object when mapping, and the summary only contains available cpus, so no harm done there. Refs open-mpi/ompi#6760 Signed-off-by: Ralph Castain <rhc@pmix.org>
Provide --report-bindings output as per @markalle's description. This carrys the functionality in the cited PR to PRRTE, modified to conform to OMPI/PRRTE coding standards. I also eliminated having two different ways of reporting the bindings so we don't confuse users, and extended the original work to support HWLOC 1.x as well as 2.x. Note that the PRRTE daemons now do intake the WHOLE_SYSTEM version of the topology tree to ensure we are accurately reporting what cpus are present on the node vs available to us. Daemons always use the summary we create for each object when mapping, and the summary only contains available cpus, so no harm done there. Refs open-mpi/ompi#6760 Signed-off-by: Ralph Castain <rhc@pmix.org>
Provide --report-bindings output as per @markalle's description. This ports the functionality in the cited PR to PRRTE, modified to conform to OMPI/PRRTE coding standards. I also eliminated having two different ways of reporting the bindings so we don't confuse users, and extended the original work to support HWLOC 1.x as well as 2.x. Note that the topology tree always includes the "complete" cpuset for each object along with the allowed cpuset (in the "cpuset" field). So printing the binding map with all processors shown simply requires using the "complete" cpuset field - you don't need to collect the "whole_system" topology. For detection purposes, pass the "signature" containing the list form of both the complete and cpuset fields. Only include the complete field if the two differ - otherwise, just leave that field blank. We don't actually use the signature for anything other than detecting that there is a difference between two machines, so this will suffice - if two machines have different cgroups, then we will treat the topologies as different. Refs open-mpi/ompi#6760 Signed-off-by: Ralph Castain <rhc@pmix.org> dd Signed-off-by: Ralph Castain <rhc@pmix.org>
This has been dealt with in PRRTE and is no longer applicable to OMPI master |
I combined a few related features here:
The incoming topology to the pretty-print functions probably didn't use the WHOLE_SYSTEM flag. In general we don't want WHOLE_SYSTEM, but for pretty printing I think it makes more sense. So I'm caching a WHOLE_SYSTEM load of the current machine and using it if the incoming topology also identifies as the current machine.
Examples of what pretty-printing looks like with/without whole system:
Suppose the machine is
and we run with
Without whole-system the printout (for both of the above) would be
(-np 2)
With whole-system the output is this, which I think is more informative
~
)When using the whole-machine option there's a bitmask available to identify
the allowed PUs, eg omitting PUs not in our cgroup. To distinguish those
PUs I'm using "~"
<>
)I like having numa markers as well as the existing separators between sockets and cores. They're a little harder since the numas are more fluid, eg sockets always contain cores not vice versa, so you can hard code a loop over sockets follwed by a loop over cores. But numas might be be above or below sockets in the tree.
This code identifies which level should be considered the child of the numas, and has each of the hard coded loops capable of adding numa markers.
Currently I don't have any tunable to turn off the numa markers. A lot of machines have fairly simple numa output where each socket contains one numa, and that ends up looking like this:
[<..../..../..../....>][<..../..../..../....>]
If others feel that's too cluttered I'm okay with having some tunable so people would have to ask for numa markers.
I may be behind the times on hwloc development, but as far as I know hwloc trees aren't guaranteed to have sockets and cores, just a MACHINE at the top and PU at the bottom. So I added a little code to the loops so it would still print the PUs on a hypothetical machine that lacked any structuring of the PUs into cores/sockets.
Signed-off-by: Mark Allen markalle@us.ibm.com