-
Notifications
You must be signed in to change notification settings - Fork 884
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mpirun args for multi-node run with multiple cores per MPI task #10071
Comments
The command line looks correct. I don't have a Slurm environment to test in, but testing locally with 2 machines worked as expected. From the error message, it looks like there may be only one CPU slot made available from Slurm to Open MPI on You are looking for output that looks something like:
What version of Slurm are you running? |
Copying from mailing list
I wouldn't be surprised if it is indeed an interaction with slurm. The only evidence I have that OpenMPI and slurm are talking to each other is that "mpirun exec" without "-np" works as expected, one MPI task per core on each node.
Trying to use 2 x 32 core nodes, the command is
Note that if I switch to "--bind-to none", it runs, but all 8 MPI tasks are placed on the first node. I guess that's consistent with it complaining about not having enough cores - there aren't 8x8 = 64 on the first node. It looks like the "--map-by" refuses to consider more than 1 node. If I manually generate a rankfile (with all hosts and slots listed explicitly) it works, by the way. Here that'd be
so I can work around the problem.
19.05.7, which I know is rather old. |
IIRC There was an issue with Slurm 19 where it was aggressively binding (per https://github.com/open-mpi/ompi/pull/6674/files). I'm curious if you set |
If you mean doing something like |
Try setting it in your From the allocation report it looks like we are seeing all of the slots.
Are you able to use If those ideas do not work then someone with the Slurm environment would need to chime in to see what might be going on. |
I tried that (.bashrc and .bash_profile, just in case), and it didn't make a difference
That gave a different error starting up
Can you think of anything else the |
The You can also enable debugging output for the daemon launch by adding The only thing I can think of at this point is that the daemon on the remote side is being restricted by Slurm somehow when we try to do the binding. |
I'll try, but if that were the case, why would it be any different with the explicit rankfile? Presumably it'd still be starting the remote daemon the same way. |
When I ssh from the job's main execute node to the other allocated nodes the path in that ssh appears to be just /usr/bin and /usr/local/bin, which is obviously much truncated compared to the path the job script sees when it's running on the job's head execution node. I'm not sure how ssh decides what environment to export and/or what shell to start, and how to control that for the mpirun's spawned ssh processes. |
On Mar 15, 2022, at 9:52 AM, Josh Hursey ***@***.******@***.***>> wrote:
The command line looks correct. I don't have a Slurm environment to test in, but testing locally with 2 machines worked as expected.
I wouldn't be surprised if it is indeed an interaction with slurm. The only evidence I have that OpenMPI and slurm are talking to each other is that "mpirun exec" without "-np" works as expected, one MPI task per core on each node.
From the error message, it looks like there may be only one CPU slot made available from Slurm to Open MPI on compute-3-56 so the runtime is warning about overloading that CPU by performing the mapping. Can you run with --display-allocation to see what the Open MPI runtime thinks it has available on each allocated node?
You are looking for output that looks something like:
====================== ALLOCATED NODES ======================
node7: flags=0x11 slots=4 max_slots=0 slots_inuse=0 state=UP
node8: flags=0x13 slots=4 max_slots=0 slots_inuse=0 state=UP
=================================================================
Trying to use 2 x 32 core nodes, the command is
mpirun -np 8 --map-by slot:PE=8 --bind-to core --report-bindings --display-allocation
and I get the following output
====================== ALLOCATED NODES ======================
compute-7-17: flags=0x11 slots=32 max_slots=0 slots_inuse=0 state=UP
compute-7-18: flags=0x11 slots=32 max_slots=0 slots_inuse=0 state=UP
=================================================================
--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
processes than cpus on a resource:
Bind to: CORE
Node: compute-7-17
#processes: 2
#cpus: 1
You can override this protection by adding the "overload-allowed"
option to your binding directive.
--------------------------------------------------------------------------
Note that if I switch to "--bind-to none", it runs, but all 8 MPI tasks are placed on the first node. I guess that's consistent with it complaining about not having enough cores - there aren't 8x8 = 64 on the first node. It looks like the "--map-by" refuses to consider more than 1 node.
If I manually generate a rankfile (with all hosts and slots listed explicitly) it works, by the way. Here that'd be
rank 0=compute-7-17 slot=0-7
rank 1=compute-7-17 slot=8-15
rank 2=compute-7-17 slot=16-23
rank 3=compute-7-17 slot=24-31
rank 4=compute-7-18 slot=0-7
rank 5=compute-7-18 slot=8-15
rank 6=compute-7-18 slot=16-23
rank 7=compute-7-18 slot=24-31
so I _can_ work around the problem.
What version of Slurm are you running?
19.05.7, which I know is rather old.
Noam
|
I'm using OpenMPI 4.1.1 with SLURM (CentOS 7), and I can't figure out how to run with a total
n_mpi_tasks = nslots / cores_per_task
and binding each MPI task to a contiguous set ofcores_per_task
cores. The documentation suggests that I needmpirun -np n_mpi_tasks --map-by slot:PE=cores_per_task --bind-to core
. When I try this for a single node job (16 slots, 4 cores per task, 4 tasks), it works fine. The bindings report shows 4 MPI tasks, one bound to the first 4 physical cores, the second to the next 4, etc.However, when I try on a multi-node job (2 x 16 core nodes, 8 tasks, 4 cores per task), it fails with an error:
This is my job script, and the partition requested has 16 physical core (32 w/ hyperthreading) nodes:
I'm not sure if this is an mpirun mapping/binding bug or just a gap in the documentation, but given that this seems like an obvious layout for a mixed MPI/OpenMP job, I think it's worth making it more clear how to do it. I might even be useful to explicitly mention something like "mixed MPI/OpenMP" in the mpirun man page, to make it easier to find.
The text was updated successfully, but these errors were encountered: