Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split out the map/rank/bind into a separate man #557

Closed
wants to merge 4 commits into from

Conversation

jjhursey
Copy link
Member

No description provided.

@jjhursey
Copy link
Member Author

I'm working on a separate man page dedicated to the map/rank/bind functionality. This PR is definitely not ready, but I wanted to post it for the community to mark progress towards this goal for the v2 release. Anyone wanting to help feel free to reach out.

@jjhursey jjhursey linked an issue May 15, 2020 that may be closed by this pull request
@jjhursey jjhursey force-pushed the map-bind-md branch 2 times, most recently from 00182a9 to 17f9c0f Compare May 21, 2020 21:09
@jjhursey
Copy link
Member Author

prun.1.md is done - there are a few ??? items in there that I need to return to.
I'll work on prte.1.md and the new map man page next.

@jjhursey
Copy link
Member Author

prte man page is done.

@jjhursey
Copy link
Member Author

Items that I removed from the prun manpage that are no longer options. Some (all?) of these were inherited from orterun so may not be valid any longer. Putting the list here in case we need to add one back. Note that I did not run with a personality specified when creating this list - we will need to think about how to document per-personality options.

`-dvm-uri <arg0>` (Note: this is now `--report-uri` in prte - do we need an alias?)

:   Specify the URI of the `prte` process, or the name of the file
    (specified as file:filename) that contains that info.

`--cpu-set <list>`

:   Restrict launched processes to the specified logical cpus on each
    node (comma-separated list). Note that the binding options will
    still apply within the specified envelope - e.g., you can elect to
    bind each process to only one cpu within the specified cpu set.

`-cpu-list, --cpu-list <cpus>`

:   List of processor IDs to bind processes to [default=NULL].

`-rf, --rankfile <rankfile>`

:   Provide a rankfile file.

`-xml-file, --xml-file <filename>`

:   Provide all output in XML format to the specified file.

`-tune, --tune <tune_file>`

:   Specify a tune file to set arguments for various MCA modules and
    environment variables. See the "Setting MCA parameters and
    environment variables from file" section, below.

`-debug, --debug` (Note: this changed to mean something different in prte)

:   Invoke the user-level debugger indicated by the
    `prte_base_user_debugger` MCA parameter.

`-debugger, --debugger <args>`

:   Sequence of debuggers to search for when `--debug` is used (i.e. a
    synonym for `prte_base_user_debugger` MCA parameter).

`-tv, --tv`

:   Launch processes under the TotalView debugger. Deprecated backwards
    compatibility flag. Synonym for `--debug`.

`-cf, --cartofile <cartofile>`

:   Provide a cartography file.

`--ppr <list>`

:   Comma-separated list of number of processes on a given resource type
    [default: none]. (deprecated in favor of --map-by ppr:<list>)

`-report-events, --report-events <URI>`

:   Report events to a tool listening at the specified URI.

`-use-hwthread-cpus, --use-hwthread-cpus`

:   Use hardware threads as independent cpus.

`-use-regexp, --use-regexp`

:   Use regular expressions for launch.

`-novm, --novm`

:   Execute without creating an allocation-spanning virtual machine
    (only start daemons on nodes hosting application procs).

The discussion about when to use --mca vs --pmixmca vs --prtemca needs some clarification for users. As does the role of the "gmca" versions for each. Are we missing a --gprtemca optoin?

I noticed that --map-by ppr:1:package does not seem to automatically set the number of processes correctly. In the example below it's using the core count (20 per node x 3 nodes) instead of the package count (2 per node x 3 nodes):

[jjhursey@c712f6n01 tests] prte --hostfile hostfile.txt --map-by ppr:1:package hostname
--------------------------------------------------------------------------
Your job has requested more processes than the ppr for
this topology can support:

  App: hostname
  Number of procs:  60
  Procs mapped:  6
  Total number of procs:  60
  PPR: 1:package

Please revise the conflict and try again.
--------------------------------------------------------------------------
[jjhursey@c712f6n01 tests] prte --hostfile hostfile.txt --np 6 --map-by ppr:1:package hostname
c712f6n01
c712f6n01
c712f6n03
c712f6n03
c712f6n02
c712f6n02

prun does not export PMIX_ and PRRTE_ envars per the prun section on "Exported Environment Variables". If you set them before prte then they are, but if you set them before prun they are not.

shell $ export PMIX_foo=1
shell $ export PRTE_foo=1
shell $ export PRRTE_foo=1
shell $ prte --hostfile hostfile.txt --daemonize
shell $ prun -np 1 env | grep foo
PRTE_foo=1
PMIX_foo=1
PRRTE_foo=1
shell $ pterm

But this does not work

shell $ prte --hostfile hostfile.txt --daemonize
shell $ export PMIX_foo=1
shell $ export PRTE_foo=1
shell $ export PRRTE_foo=1
shell $ prun -np 1 env | grep foo
shell $ pterm

This prun option is listed in the prun --help but I could not find the backing functionality (and I thought we removed the "am" options in favor of "tune" - or was that the other way around?)

`--pmixam <arg0>`

:   Aggregate MCA parameter set file list for OpenPMIx.

@jjhursey
Copy link
Member Author

I noticed in this example (testing --np 6):

$ cat myhostfile
aa slots=4
bb slots=4
cc slots=4

Before (from mpirun) the --map-by node line was as follows (round-robin):

                              node aa      node bb      node cc
prun                          0 1 2 3      4 5
prun --map-by node            0 3          1 4          2 5
prun --map-by node:NOLOCAL                 0 1 2 3      4 5

But now it is (sequential):

                              node aa      node bb      node cc
prun                          0 1 2 3      4 5
prun --map-by node            0 1          2 3          4 5
prun --map-by node:NOLOCAL                 0 1 2 3      4 5

I was running with:

prte --hostfile hostfile.txt --np 6 --map-by node --bind-to :REPORT /bin/true

Is this a change in behavior or a typo in the original?

@jjhursey
Copy link
Member Author

If the nodes are oversubscribed the binding report is empty. Should we print out something like "MCW rank 0 is unbound" or "MCW rank 0 bound to nothing"?

@jjhursey
Copy link
Member Author

The max_slots option to the hostfile is not working as expected form the manpage.

It says the following, but I'm seeing that "max_slots" is being ignored and the extra processes are on aa and bb instead of being forced to cc:

Limits to oversubscription can also be specified in the hostfile itself:

% cat myhostfile
aa slots=4 max_slots=4
bb         max_slots=4
cc slots=4

The max_slots field specifies such a limit. When it does, the slots
value defaults to the limit. Now:

prun --hostfile myhostfile --np 14 ./a.out

: causes the first 12 processes to be launched as before, but the
remaining two processes will be forced onto node cc. The other two
nodes are protected by the hostfile against oversubscription by this
job since they specified the max_slots option.

@jjhursey
Copy link
Member Author

The sequential mapper thew an unexpected error without the -np arg, and then segv'ed with it using the example from the manpage:

shell $ cat hostfile.txt 
c712f6n01 slots=4
c712f6n02 slots=4
c712f6n03 slots=4
shell $ prte --hostfile hostfile.txt --prtemca rmaps seq  /bin/true
--------------------------------------------------------------------------
A sequential map was requested, but not enough node entries
were given to support the requested number of processes:

  Num procs:  12
  Num nodes:  3

We cannot continue - please adjust either the number of processes
or provide more node locations in the file.
--------------------------------------------------------------------------
shell $ prte --hostfile hostfile.txt --np 3 --prtemca rmaps seq  /bin/true
[c712f6n02:04818] *** Process received signal ***
[c712f6n03:145520] *** Process received signal ***
[c712f6n03:145520] Signal: Segmentation fault (11)
[c712f6n03:145520] Signal code: Address not mapped (1)
[c712f6n03:145520] Failing at address: 0x5c
[c712f6n03:145520] [ 0] [0x3fff92b90478]
[c712f6n03:145520] [ 1] [0x0]
[c712f6n03:145520] [ 2] /smpi_dev/jjhursey/dev/pmix/install/prrte-master-x-master-dbg/lib/libprrte.so.2(prte_odls_base_spawn_proc+0x134)[0x3fff92ad12c4]
[c712f6n03:145520] [ 3] /smpi_dev/jjhursey/local/libevent/lib/libevent_core-2.1.so.6(+0x26cb8)[0x3fff926b6cb8]
[c712f6n03:145520] [ 4] /smpi_dev/jjhursey/local/libevent/lib/libevent_core-2.1.so.6(event_base_loop+0x504)[0x3fff926b77a4]
[c712f6n03:145520] [ 5] prted[0x10008928]
[c712f6n03:145520] [ 6] /lib64/libc.so.6(+0x25280)[0x3fff91da5280]
[c712f6n03:145520] [ 7] [c712f6n02:04818] Signal: Segmentation fault (11)
[c712f6n02:04818] Signal code: Address not mapped (1)
[c712f6n02:04818] Failing at address: 0x5c
[c712f6n02:04818] [ 0] [0x3fffa0680478]
[c712f6n02:04818] [ 1] [0x0]
[c712f6n02:04818] [ 2] /smpi_dev/jjhursey/dev/pmix/install/prrte-master-x-master-dbg/lib/libprrte.so.2(prte_odls_base_spawn_proc+0x134)[0x3fffa05c12c4]
[c712f6n02:04818] [ 3] /smpi_dev/jjhursey/local/libevent/lib/libevent_core-2.1.so.6(+0x26cb8)[0x3fffa01a6cb8]
[c712f6n02:04818] [ 4] /smpi_dev/jjhursey/local/libevent/lib/libevent_core-2.1.so.6(event_base_loop+0x504)[0x3fffa01a77a4]
[c712f6n02:04818] [ 5] prted[0x10008928]
[c712f6n02:04818] [ 6] /lib64/libc.so.6(+0x25280)[0x3fff9f895280]
[c712f6n02:04818] [ 7] /lib64/libc.so.6(__libc_start_main+0xc4)[0x3fff91da5474]
[c712f6n03:145520] *** End of error message ***
/lib64/libc.so.6(__libc_start_main+0xc4)[0x3fff9f895474]
[c712f6n02:04818] *** End of error message ***

@jjhursey
Copy link
Member Author

Left to do:

  • There are some ??? places that need descriptions.
  • The prte-map manapge has some empty sections at the end that need content still.
  • The prte-map manapge needs some more "frequently requested" examples - there is a special section
  • The examples in prte-map need to be validated. I dropped some notes on unexpected behavior when doing some limited testing into the comments in this PR. We'll need to go back and fix those before release.

I'm off next week (back June 1) - the community can feel free to push updates to my branch if they want to help with the pages. Otherwise I'll keep working on this when I get back.

@rhc54
Copy link
Contributor

rhc54 commented May 22, 2020

I think we are going to hit a lot of confusion if we aren't careful here. First, you cannot execute the cmds as you are showing them here:

$ prte --hostfile hostfile.txt --prtemca rmaps seq  /bin/true

The prte binary is hard coded to never directly run an application - it can only be used to directly start an application if you have symlink'd it to another command such as mpirun. Showing it this way is going to confuse people as it won't work if they copy/paste this into a shell.

So I have to assume that the errors you are reporting are from you actually running those cmds using something like mpirun. This is important as PRRTE currently automatically sets the personality to OMPI - but that will definitely not be the case in the future. So we need to be very clear on just what cmd is actually being executed in these examples and in the errors being posted on these issues (so I don't spend wasted cycles trying to reproduce).

@rhc54
Copy link
Contributor

rhc54 commented May 25, 2020

I fixed the sequential mapper, and I also added a new prterun command that basically is just a symlink to prte. The fact that the command isn't prte means that it will execute as a one-shot just like mpirun, but it will not automatically pick up a personality - i.e., it will retain only the base PRRTE options. This might help serve for the examples in the documentation.

@jjhursey
Copy link
Member Author

jjhursey commented Jun 3, 2020

Actually I was running prte directly in those examples, not mpirun and it was launching fine - just without the OMPI personality. I'll switch over to use prterun as I agree that is clearer.

For the PRRTE man pages I want them to reflect the PRRTE behavior without any personality. Then we can have separate sections for various personalities or something.

@drwootton
Copy link
Contributor

I tested a number of variations of the --map-by, --bind-to and --rank-by options with prterun and found the following problems where it appears the documentation (prte-map.1.md) should be updated.

The cluster I tested this with had 4 Power 8 nodes with 2 packages (sockets) per node, each with 10 cores and each core with 8 hwthreads (160 total hwthreads). The launch/local node was a Power 9 node

The naming convention for hostfiles is hostfile where is the number of slots specified with the slots= keyword, and where the hostfile lists the 4 nodes in the cluster.

  1. I ran **prterun -n 4 --hostfile1 --map-by pp1:1:core taskinfo where hostfile lists 4 nodes with slots=1 specified for each. An error message was issued, including the following text

In all the above cases, if you want PRTE to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.

I added the --use-hwthread-cpus option and got an error message stating that was an unknown option.
I talked with @jjhursey about this and he suggested I note the problem here even though it seems to be a problem with the error message text.

  1. The example output for the --report-binding option in the Mapping, Binding and Ranking: Advanced section does not match the current output when I run prterun

  2. I ran prterun -n 8 --hostfile hostfile4 --map-by ppr:1:slot taskinfo which resulted in an error message telling me the slot keyword was invalid.

The help text told me that numa was one of the allowed choices. When I replaced slot with numa I got a different error message telling me that numa was invalid.

  1. The Quick Summary section explains the default binding patterns. This section should clarify whether the 2 task limit is task per node or tasks across the entire cluster.

  2. There's an example in the Mapping Processes to Nodes: Using Policies section which I ran as, prterun -n 14 --hostfile hostfile4a taskinfo, and where hostfile4a lists 3 nodes with 4 slots each (12 total). The example says this should run, but I get an error message telling me there are not enough slots available (expected since I did not specify an oversubscribe option.

The same error occurs for **prun –host c712f6n01,c712f6n02 –np 8 ./a.out ** where the text says the additional 6 tasks should be allocated to these two nodes as well. This does not work, as expected with prterun unless an oversubscribe option is specified.

  1. The default --map-by option behavior is not documented
    The explanation for the --map-by option in the Process Mapping / Ranking / Binding Options** refers to the Quick Summary section for an explanation of the default behavior. Maybe the sentence stating that by default nodes are scheduled in round-robin fashion is what is being referred to, but it was not immediately clear that was the reference.

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
 * Remove
   - `pmixam` since it wasn't being processed
 * Add
   - `gmca`
   - `gprtemca`

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
 * Still lots to cleanup and verify here.

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
@drwootton
Copy link
Contributor

I'm not sure if this is a documentation problem or a code problem, but if I run the command prterun -n 24 --hostfile8 --bind-to slot taskinfo I get a message that the binding policy slot is not recognized.

@rhc54
Copy link
Contributor

rhc54 commented Nov 3, 2020

the binding policy slot is not recognized.

Correct - "slot" has no physical meaning, so we cannot bind you to it.

@jjhursey
Copy link
Member Author

jjhursey commented Dec 3, 2020

Per #696 clarify that PE-LIST is an unordered listing that is processed in numerical order. Meaning 0,1,2,3 / 3,2,1,0 / 1,3,2,0 are the same and processed as 0,1,2,3

@rhc54
Copy link
Contributor

rhc54 commented Dec 3, 2020

unordered listing that is processed in numerical order.

I don't think that sentence makes sense, nor do I think that is what is happening. The PE-LIST is an unordered list of CPUs, and we treat the result as exactly that - an unordered "pool" of CPUs. Depending upon what you are doing, that could result in things being processed in numerical order, or it might not.

@jjhursey
Copy link
Member Author

Replaced by PR #773

@jjhursey jjhursey closed this Feb 23, 2021
@jjhursey jjhursey deleted the map-bind-md branch February 23, 2021 02:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Need to update prun.1 manpage with CLI changes
3 participants