Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to make prterun use round-robin mapping behavior #674

Closed
drwootton opened this issue Nov 3, 2020 · 10 comments
Closed

Unable to make prterun use round-robin mapping behavior #674

drwootton opened this issue Nov 3, 2020 · 10 comments
Assignees
Milestone

Comments

@drwootton
Copy link
Contributor

Background information

What version of the PMIx Reference Server are you using? (e.g., v1.0, v2.1, git master @ hash, etc.)

Master branch

What version of PMIx are you using? (e.g., v1.2.5, v2.0.3, v2.1.0, git branch name and hash, etc.)

Master branch

Please describe the system on which you are running

  • Operating system/version:
    RHEL 7.7

  • Computer hardware:

4 Power 8 nodes, 2 sockets each with 10 cores (20 total) and 8 hwthreads per core (160 total)

  • Network type:

Hostfile specifies 4 nodes, each with slots=8 keyword

Details of the problem

I tried to make prterun use round-robin mapping and have been unable to make that work.

I started with the command prterun -n 24 --hostfile hostfile8 --map-by package --bind-to package:REPORT where hostfile8 lists 4 nodes with 8 slots each.

The binding report showed that the 24 tasks were allocated as 8 tasks on each of the first 3 nodes and the 4th node left empty

[c712f6n01:08305] MCW rank 0 bound to package[0][core:0-9]
[c712f6n01:08030] MCW rank 1 bound to package[1][core:10-19]
[c712f6n01:08030] MCW rank 2 bound to package[0][core:0-9]
[c712f6n01:08030] MCW rank 3 bound to package[1][core:10-19]
[c712f6n01:08030] MCW rank 4 bound to package[0][core:0-9]
[c712f6n01:08030] MCW rank 5 bound to package[1][core:10-19]
[c712f6n01:08030] MCW rank 6 bound to package[0][core:0-9]
[c712f6n01:08030] MCW rank 7 bound to package[1][core:10-19]
[c712f6n02:12311] MCW rank 8 bound to package[0][core:0-9]
[c712f6n02:12311] MCW rank 9 bound to package[1][core:10-19]
[c712f6n02:12311] MCW rank 10 bound to package[0][core:0-9]
[c712f6n02:12311] MCW rank 11 bound to package[1][core:10-19]
[c712f6n02:12311] MCW rank 12 bound to package[0][core:0-9]
[c712f6n02:12311] MCW rank 13 bound to package[1][core:10-19]
[c712f6n02:12311] MCW rank 14 bound to package[0][core:0-9]
[c712f6n02:12311] MCW rank 15 bound to package[1][core:10-19]
[c712f6n03:86579] MCW rank 16 bound to package[0][core:0-9]
[c712f6n03:86579] MCW rank 17 bound to package[1][core:10-19]
[c712f6n03:86579] MCW rank 18 bound to package[0][core:0-9]
[c712f6n03:86579] MCW rank 19 bound to package[1][core:10-19]
[c712f6n03:86579] MCW rank 20 bound to package[0][core:0-9]
[c712f6n03:86579] MCW rank 21 bound to package[1][core:10-19]
[c712f6n03:86579] MCW rank 22 bound to package[0][core:0-9]
[c712f6n03:86579] MCW rank 23 bound to package[1][core:10-19]

Then I changed the --map-by option to --map-by package:SPAN and the allocation changed to 6 tasks on each of the 4 nodes, so SPAN did balance the allocation, but tasks are still not allocated round robin.

[c712f6n01:08401] MCW rank 0 bound to package[0][core:0-9]
[c712f6n01:08401] MCW rank 1 bound to package[1][core:10-19]
[c712f6n01:08401] MCW rank 2 bound to package[0][core:0-9]
[c712f6n01:08401] MCW rank 3 bound to package[1][core:10-19]
[c712f6n01:08401] MCW rank 4 bound to package[0][core:0-9]
[c712f6n01:08401] MCW rank 5 bound to package[1][core:10-19]
[c712f6n02:12678] MCW rank 6 bound to package[0][core:0-9]
[c712f6n02:12678] MCW rank 7 bound to package[1][core:10-19]
[c712f6n02:12678] MCW rank 8 bound to package[0][core:0-9]
[c712f6n02:12678] MCW rank 9 bound to package[1][core:10-19]
[c712f6n02:12678] MCW rank 10 bound to package[0][core:0-9]
[c712f6n02:12678] MCW rank 11 bound to package[1][core:10-19]
[c712f6n03:86942] MCW rank 12 bound to package[0][core:0-9]
[c712f6n03:86942] MCW rank 13 bound to package[1][core:10-19]
[c712f6n03:86942] MCW rank 14 bound to package[0][core:0-9]
[c712f6n03:86942] MCW rank 15 bound to package[1][core:10-19]
[c712f6n03:86942] MCW rank 16 bound to package[0][core:0-9]
[c712f6n03:86942] MCW rank 17 bound to package[1][core:10-19]
[c712f6n04:74518] MCW rank 18 bound to package[0][core:0-9]
[c712f6n04:74518] MCW rank 19 bound to package[1][core:10-19]
[c712f6n04:74518] MCW rank 20 bound to package[0][core:0-9]
[c712f6n04:74518] MCW rank 21 bound to package[1][core:10-19]
[c712f6n04:74518] MCW rank 22 bound to package[0][core:0-9]
[c712f6n04:74518] MCW rank 23 bound to package[1][core:10-19]

In the current documentation #557 prte-mp.1.md there's a statement that by default tasks are scheduled round-robin. I thought that was explaining default mapping behavior so I tried prterun -n 32 --hostfile hostfile8 --bind-to package:REPORT taskinfo and got the same bind report as the first,

[c712f6n01:08699] MCW rank 0 bound to package[0][core:0-9]
[c712f6n01:08699] MCW rank 1 bound to package[1][core:10-19]
[c712f6n01:08699] MCW rank 2 bound to package[0][core:0-9]
[c712f6n01:08699] MCW rank 3 bound to package[1][core:10-19]
[c712f6n01:08699] MCW rank 4 bound to package[0][core:0-9]
[c712f6n01:08699] MCW rank 5 bound to package[1][core:10-19]
[c712f6n01:08699] MCW rank 6 bound to package[0][core:0-9]
[c712f6n01:08699] MCW rank 7 bound to package[1][core:10-19]
[c712f6n02:12968] MCW rank 8 bound to package[0][core:0-9]
[c712f6n02:12968] MCW rank 9 bound to package[1][core:10-19]
[c712f6n02:12968] MCW rank 10 bound to package[0][core:0-9]
[c712f6n02:12968] MCW rank 11 bound to package[1][core:10-19]
[c712f6n02:12968] MCW rank 12 bound to package[0][core:0-9]
[c712f6n02:12968] MCW rank 13 bound to package[1][core:10-19]
[c712f6n02:12968] MCW rank 14 bound to package[0][core:0-9]
[c712f6n02:12968] MCW rank 15 bound to package[1][core:10-19]
[c712f6n03:87227] MCW rank 16 bound to package[0][core:0-9]
[c712f6n03:87227] MCW rank 17 bound to package[1][core:10-19]
[c712f6n03:87227] MCW rank 18 bound to package[0][core:0-9]
[c712f6n03:87227] MCW rank 19 bound to package[1][core:10-19]
[c712f6n03:87227] MCW rank 20 bound to package[0][core:0-9]
[c712f6n03:87227] MCW rank 21 bound to package[1][core:10-19]
[c712f6n03:87227] MCW rank 22 bound to package[0][core:0-9]
[c712f6n03:87227] MCW rank 23 bound to package[1][core:10-19]
@drwootton
Copy link
Contributor Author

There's at least two use cases I can think of where the layout of tasks matters

  1. Single threaded MPI tasks where I might want to pack as many tasks onto a node as I can so that MPI might take advantage of shared memory message passing
  2. Mixed mode MPI/OpenMP tasks where I have tasks with threads where I want to distribute tasks across all nodes in my allocation, round-robin or spanned such that M*N does not exceed the number of available resources requested in order to avoid different task's threads competing for CPUs or hwthreads.

I seems to me anything more complicated than that is a matter for using :pe_list or rankfiles

@rhc54
Copy link
Contributor

rhc54 commented Nov 3, 2020

I'm not sure what you mean by "round-robin" - could you explain what result you were trying to achieve? It looks to me like it is indeed performing a "round-robin" mapping, so I suspect our definitions of that term are different.

@drwootton
Copy link
Contributor Author

For round-robin, if I have 8 tasks to run on 4 nodes, then I expect the following node/task mapping
node 0/task 0,task 4
node 1/task 1,task 5
node 2/task 2,task 6
node 3/task 3,task 7

What I am seeing is tasks mapped sequentially on a node until the node is fully allocated then the next node is used

@drwootton
Copy link
Contributor Author

Also, if I am mapping or binding by something smaller than node, like core, I am still expecting tasks to be mapped similarly, where there might be only one core used on each node

@rhc54
Copy link
Contributor

rhc54 commented Nov 3, 2020

That would be round-robin by node, not package. Round-robin by package would result in mapping tasks evenly across the two packages on node 0 until that node was full, and then moving on to the next node - which looks like exactly what PRRTE is doing. Round-robin by package:span would map tasks evenly across all packages across all nodes - which again looks like what it did.

@drwootton
Copy link
Contributor Author

If I specify --map-by node and bind-to core then I might be getting the round-robin behavior I was expecting because with that I see all 4 nodes allocated with 6 tasks per node where tasks 0-5 are using cores 0-5 on node 1, 6-11 using cores 0-5 on node 2, etc.

If mapping and binding are working the way that was intended that's fine. I just wanted to be sure nothing was broken. @jjhursey asked me to do some testing of mapping, binding and ranking. In reading the documentation I understood that default behavior was to map round-robin and if --map-by options were specified then the allocations would change and that wasn't clear to me that I was getting correct round robin behavior.

Maybe I'm also too worried about where specific task ranks are assigned, and maybe I'm being further confused by the ranking step.

@rhc54
Copy link
Contributor

rhc54 commented Nov 3, 2020

default behavior was to map round-robin

Round-robin by package is indeed the default - however, it is not package:span.

From what you show, it is working as expected.

@jjhursey
Copy link
Member

jjhursey commented Dec 3, 2020

Double-check my understanding here.

If we have the hostfile below:

hostfile:
----------
node1 slots=8
node2 slots=8
node3 slots=8
node4 slots=8

Let X(P0) mean place a process on package 0, and X(P1) mean place a process on package 1.

And the rule:

Round-robin by package would map tasks evenly across the two packages on a node until it is full (slot limit), then move to the next node.

Or more generally:

Round-robin by OBJ would map tasks evenly across the N OBJs on a node until the node is full (slot limit), then move to the next node.

Example 1:

prterun -n 24 --hostfile hostfile --map-by package --bind-to package:REPORT a.out

Will result in 8 processes on the first 3 nodes (since slots=8)

      | 
node1 | X(P0) X(P1) X(P0) X(P1) X(P0) X(P1) X(P0) X(P1) | 
node2 | X(P0) X(P1) X(P0) X(P1) X(P0) X(P1) X(P0) X(P1) | 
node3 | X(P0) X(P1) X(P0) X(P1) X(P0) X(P1) X(P0) X(P1) | 
node4 |                                                 |

Example 2:

prterun -n 24 --hostfile hostfile --map-by package --bind-to package:REPORT a.out

If you changed the hostfile to have slots=6 then you would see 6 processes on the first 4 nodes (since slots=6)

      | 
node1 | X(P0) X(P1) X(P0) X(P1) X(P0) X(P1) |
node2 | X(P0) X(P1) X(P0) X(P1) X(P0) X(P1) |
node3 | X(P0) X(P1) X(P0) X(P1) X(P0) X(P1) |
node4 | X(P0) X(P1) X(P0) X(P1) X(P0) X(P1) |

Example 3:

For :SPAN it will try to evenly distribute the tasks over the nodes in the allocation. So first it calculates np / nodes which is 24 / 4 = 6. Note that the --map-by resource is not used as the divisor it is always the number of node (right?).

prterun -n 24 --hostfile hostfile --map-by package:SPAN --bind-to package:REPORT

Would result in 6 processes per node across all 4 nodes. Slots is only used to determine if we oversubscribe.

      | 
node1 | X(P0) X(P1) X(P0) X(P1) X(P0) X(P1) |
node2 | X(P0) X(P1) X(P0) X(P1) X(P0) X(P1) |
node3 | X(P0) X(P1) X(P0) X(P1) X(P0) X(P1) |
node4 | X(P0) X(P1) X(P0) X(P1) X(P0) X(P1) |

Example 4:

prterun -n 24 --hostfile hostfile --bind-to package:REPORT (implies --map-by package since np>2)

This should result in 8 processes across 3 nodes.

      | 
node1 | X(P0) X(P1) X(P0) X(P1) X(P0) X(P1) X(P0) X(P1) | 
node2 | X(P0) X(P1) X(P0) X(P1) X(P0) X(P1) X(P0) X(P1) | 
node3 | X(P0) X(P1) X(P0) X(P1) X(P0) X(P1) X(P0) X(P1) | 
node4 |

Example 5:

prterun -n 32 --hostfile hostfile --bind-to package:REPORT (implies --map-by package since np>2)

This should result in 8 processes across 4 nodes.

      | 
node1 | X(P0) X(P1) X(P0) X(P1) X(P0) X(P1) X(P0) X(P1) | 
node2 | X(P0) X(P1) X(P0) X(P1) X(P0) X(P1) X(P0) X(P1) | 
node3 | X(P0) X(P1) X(P0) X(P1) X(P0) X(P1) X(P0) X(P1) | 
node4 | X(P0) X(P1) X(P0) X(P1) X(P0) X(P1) X(P0) X(P1) | 

@drwootton In your original report, the -n 32 option only displayed 24 processes instead of all 32. Was that correct or did you mean -n 24?

Commentary

Note 1

We need to document that if --map-by package when np>2 and --map-by core when np <=2. We document that --bind-to version of that, but it seems that --map-by has the same logic. See this comment for an example. Is that expected behavior?

Note 2

For :SPAN if we have an uneven number of slots per host will it use the smallest number of slots available?
For example with the hostfile:

node1 slot=4
node2 slot=2
node3 slot=4

A -np 6 would put 2 per node. Would/Should -np 8 result in an error or would it put 3 on node1, 2 on node2, 3 on node3?

Note 3

The misunderstanding that I had with this was that with the definition of round-robin below (which is fine) that the round-robin doesn't hop to the next node when it hits the hardware limit for the OBJ, but instead when it hits the slot limit.

Round-robin by OBJ would map tasks evenly across the N OBJs on a node until the node is full (slot limit), then move to the next node.

So if I called prterun -n 28 --hostfile hostfile --map-by package --bind-to package:REPORT a.out

If we hopped to the next node when we hit the number of packages (2 in this case) then we would iterate in the mapping four times with the last iteration placing 2 processes on both node1 and node2.

      | iter 1      | iter 2      | iter 3      | iter 4      |
node1 | X(P0) X(P1) | X(P0) X(P1) | X(P0) X(P1) | X(P0) X(P1) | 
node2 | X(P0) X(P1) | X(P0) X(P1) | X(P0) X(P1) | X(P0) X(P1) | 
node3 | X(P0) X(P1) | X(P0) X(P1) | X(P0) X(P1) |             |
node4 | X(P0) X(P1) | X(P0) X(P1) | X(P0) X(P1) |             | 

Rather what is happening is that we ignore the package hardware limit and use the slots listing. Thus filling the first three nodes and placing the extra 4 processes on the last node.

      | 
node1 | X(P0) X(P1)  X(P0) X(P1)  X(P0) X(P1)  X(P0) X(P1) | 
node2 | X(P0) X(P1)  X(P0) X(P1)  X(P0) X(P1)  X(P0) X(P1) | 
node3 | X(P0) X(P1)  X(P0) X(P1)  X(P0) X(P1)  X(P0) X(P1) | 
node4 | X(P0) X(P1)  X(P0) X(P1)                           | 

Is my understanding here correct? If so then I'll write up this example (in a better form) for the documentation.

@rhc54
Copy link
Contributor

rhc54 commented Dec 3, 2020

For :SPAN it will try to evenly distribute the tasks over the nodes in the allocation. So first it calculates np / nodes which is 24 / 4 = 6. Note that the --map-by resource is not used as the divisor it is always the number of node (right?).

I'm afraid that is not correct. The divisor is the total number of OBJs of that type across all allocated nodes. In this case, you have two packages on each of 4 nodes, so that means the divisor is 8, and the mapper will assign 3 procs to each package of every node.

Note 1
We need to document that if --map-by package when np>2 and --map-by core when np <=2. We document that --bind-to version of that, but it seems that --map-by has the same logic.

As stated in the other issue, it does

Note 2
For :SPAN if we have an uneven number of slots per host will it use the smallest number of slots available?

No - as stated above, SPAN applies to the object type, not the nodes. What actually happens here is a little more complicated. First, we check the number of requested procs against the total number of available slots (summed across all allocated nodes) to see if we can even run the job. If there aren't enough slots, then we check if oversubscribe is allowed. If not, we immediately error out. If it is, then we continue.

Next, we compute the average number of procs/object by dividing the number of procs by the total number of objects across all allocated nodes. We then begin assigning that number to each object, constrained by the number of slots on the node.

At the end of that pass, we see that we have leftover procs. Given that oversubscribed was allowed (or else we would have errored out right away), we go back and add one proc at a time in a round-robin fashion across the objects until we have all the procs.

Note 3
Is my understanding here correct?

Yes - because you didn't say :SPAN, we fill each node before moving on to the next. So you will indeed get the layout in your last figure.

@jjhursey
Copy link
Member

jjhursey commented Dec 4, 2020

Ok. That makes sense to me then. I've flagged this as a documentation item. I'll try to summarize this in the documentation around --map-by and :SPAN.

@jjhursey jjhursey added this to the v2.0.0 milestone Mar 25, 2021
@jjhursey jjhursey self-assigned this Mar 25, 2021
@rhc54 rhc54 closed this as completed May 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants