Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mpirun args for multi-node run with multiple cores per MPI task #10071

Open
bernstei opened this issue Mar 4, 2022 · 10 comments
Open

mpirun args for multi-node run with multiple cores per MPI task #10071

bernstei opened this issue Mar 4, 2022 · 10 comments

Comments

@bernstei
Copy link

bernstei commented Mar 4, 2022

I'm using OpenMPI 4.1.1 with SLURM (CentOS 7), and I can't figure out how to run with a total n_mpi_tasks = nslots / cores_per_task and binding each MPI task to a contiguous set of cores_per_task cores. The documentation suggests that I need mpirun -np n_mpi_tasks --map-by slot:PE=cores_per_task --bind-to core. When I try this for a single node job (16 slots, 4 cores per task, 4 tasks), it works fine. The bindings report shows 4 MPI tasks, one bound to the first 4 physical cores, the second to the next 4, etc.

[compute-3-58.local:08478] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]]: [BB/BB/BB/BB/../../../..][../../../../../../../..]
[compute-3-58.local:08478] MCW rank 1 bound to socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: [../../../../BB/BB/BB/BB][../../../../../../../..]
[compute-3-58.local:08478] MCW rank 2 bound to socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]]: [../../../../../../../..][BB/BB/BB/BB/../../../..]
[compute-3-58.local:08478] MCW rank 3 bound to socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: [../../../../../../../..][../../../../BB/BB/BB/BB]

However, when I try on a multi-node job (2 x 16 core nodes, 8 tasks, 4 cores per task), it fails with an error:

--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to:     CORE
   Node:        compute-3-56
   #processes:  2
   #cpus:       1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--------------------------------------------------------------------------

This is my job script, and the partition requested has 16 physical core (32 w/ hyperthreading) nodes:

#!/bin/bash
#SBATCH --ntasks=32
#SBATCH --ntasks-per-core=1
#SBATCH --partition=n2016
#SBATCH --partition=n2016
#SBATCH --time 0:01:00
#SBATCH --output vasp_test_distribution.32.stdout
#SBATCH --error vasp_test_distribution.32.stderr

mpirun --version

mpirun -np 8 --map-by slot:PE=4 --bind-to core --report-bindings hostname

I'm not sure if this is an mpirun mapping/binding bug or just a gap in the documentation, but given that this seems like an obvious layout for a mixed MPI/OpenMP job, I think it's worth making it more clear how to do it. I might even be useful to explicitly mention something like "mixed MPI/OpenMP" in the mpirun man page, to make it easier to find.

@jjhursey
Copy link
Member

The command line looks correct. I don't have a Slurm environment to test in, but testing locally with 2 machines worked as expected.

From the error message, it looks like there may be only one CPU slot made available from Slurm to Open MPI on compute-3-56 so the runtime is warning about overloading that CPU by performing the mapping. Can you run with --display-allocation to see what the Open MPI runtime thinks it has available on each allocated node?

You are looking for output that looks something like:

======================   ALLOCATED NODES   ======================
	node7: flags=0x11 slots=4 max_slots=0 slots_inuse=0 state=UP
	node8: flags=0x13 slots=4 max_slots=0 slots_inuse=0 state=UP
=================================================================

What version of Slurm are you running?

@bernstei
Copy link
Author

bernstei commented Mar 15, 2022

Copying from mailing list

The command line looks correct. I don't have a Slurm environment to test in, but testing locally with 2 machines worked as expected.

I wouldn't be surprised if it is indeed an interaction with slurm. The only evidence I have that OpenMPI and slurm are talking to each other is that "mpirun exec" without "-np" works as expected, one MPI task per core on each node.

From the error message, it looks like there may be only one CPU slot made available from Slurm to Open MPI on compute-3-56 so the runtime is warning about overloading that CPU by performing the mapping. Can you run with --display-allocation to see what the Open MPI runtime thinks it has available on each allocated node?

Trying to use 2 x 32 core nodes, the command is
mpirun -np 8 --map-by slot:PE=8 --bind-to core --report-bindings --display-allocation
and I get the following output

======================   ALLOCATED NODES   ======================
        compute-7-17: flags=0x11 slots=32 max_slots=0 slots_inuse=0 state=UP
        compute-7-18: flags=0x11 slots=32 max_slots=0 slots_inuse=0 state=UP
=================================================================
--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to:     CORE
   Node:        compute-7-17
   #processes:  2
   #cpus:       1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--------------------------------------------------------------------------

Note that if I switch to "--bind-to none", it runs, but all 8 MPI tasks are placed on the first node. I guess that's consistent with it complaining about not having enough cores - there aren't 8x8 = 64 on the first node. It looks like the "--map-by" refuses to consider more than 1 node.

If I manually generate a rankfile (with all hosts and slots listed explicitly) it works, by the way. Here that'd be

rank 0=compute-7-17 slot=0-7
rank 1=compute-7-17 slot=8-15
rank 2=compute-7-17 slot=16-23
rank 3=compute-7-17 slot=24-31
rank 4=compute-7-18 slot=0-7
rank 5=compute-7-18 slot=8-15
rank 6=compute-7-18 slot=16-23
rank 7=compute-7-18 slot=24-31

so I can work around the problem.

What version of Slurm are you running?

19.05.7, which I know is rather old.

@jjhursey
Copy link
Member

IIRC There was an issue with Slurm 19 where it was aggressively binding (per https://github.com/open-mpi/ompi/pull/6674/files). I'm curious if you set SLURM_CPU_BIND=none in your default environment if that resolves the issue.

@bernstei
Copy link
Author

bernstei commented Mar 15, 2022

If you mean doing something like env SLURM_CPU_BIND=none sbatch job.batch, that doesn't make a difference. Note that I did "man sbatch", and there was no documented --cpu-bind (it was mentioned in the man page section for --mem-bind, but not listed as an available option) or SLURM_CPU_BIND.

@jjhursey
Copy link
Member

Try setting it in your .bashrc (or similar). We would need the orted to pick it up.

From the allocation report it looks like we are seeing all of the slots.

======================   ALLOCATED NODES   ======================
        compute-7-17: flags=0x11 slots=32 max_slots=0 slots_inuse=0 state=UP
        compute-7-18: flags=0x11 slots=32 max_slots=0 slots_inuse=0 state=UP
=================================================================

Are you able to use ssh to move between nodes in the allocation? If so you might try running with -mca plm ^slurm which will use the ssh launcher instead of srun.

If those ideas do not work then someone with the Slurm environment would need to chime in to see what might be going on.

@bernstei
Copy link
Author

Try setting it in your .bashrc (or similar). We would need the orted to pick it up.

I tried that (.bashrc and .bash_profile, just in case), and it didn't make a difference

From the allocation report it looks like we are seeing all of the slots.

======================   ALLOCATED NODES   ======================
        compute-7-17: flags=0x11 slots=32 max_slots=0 slots_inuse=0 state=UP
        compute-7-18: flags=0x11 slots=32 max_slots=0 slots_inuse=0 state=UP
=================================================================

Are you able to use ssh to move between nodes in the allocation? If so you might try running with -mca plm ^slurm which will use the ssh launcher instead of srun.

That gave a different error starting up

bash: orted: command not found
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------

If those ideas do not work then someone with the Slurm environment would need to chime in to see what might be going on.

Can you think of anything else the --map-by option could be doing? mpirun can do the right thing if I give it an explicit rankfile. Is there any way to get verbose output from the mapping step?

@jjhursey
Copy link
Member

The bash: orted: command not found seems to indicate that Open MPI is not in your path on the remote machine. Can you ssh to that node and do a which orted mpirun - you might need to adjust the PATH on that node.

You can also enable debugging output for the daemon launch by adding -mca plm_base_verbose 100 --debug-daemons -mca odls_base_verbose 100 that will show you the command that mpirun is using to launch the remote daemons along with (hopefully) some tracing output from the daemons as they try to launch the process.

The only thing I can think of at this point is that the daemon on the remote side is being restricted by Slurm somehow when we try to do the binding.

@bernstei
Copy link
Author

The only thing I can think of at this point is that the daemon on the remote side is being restricted by Slurm somehow when we try to do the binding.

I'll try, but if that were the case, why would it be any different with the explicit rankfile? Presumably it'd still be starting the remote daemon the same way.

@bernstei
Copy link
Author

When I ssh from the job's main execute node to the other allocated nodes the path in that ssh appears to be just /usr/bin and /usr/local/bin, which is obviously much truncated compared to the path the job script sees when it's running on the job's head execution node. I'm not sure how ssh decides what environment to export and/or what shell to start, and how to control that for the mpirun's spawned ssh processes.

@bernstei
Copy link
Author

bernstei commented Oct 11, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants