Description
I'm using OpenMPI 4.1.1 with SLURM (CentOS 7), and I can't figure out how to run with a total n_mpi_tasks = nslots / cores_per_task
and binding each MPI task to a contiguous set of cores_per_task
cores. The documentation suggests that I need mpirun -np n_mpi_tasks --map-by slot:PE=cores_per_task --bind-to core
. When I try this for a single node job (16 slots, 4 cores per task, 4 tasks), it works fine. The bindings report shows 4 MPI tasks, one bound to the first 4 physical cores, the second to the next 4, etc.
[compute-3-58.local:08478] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]]: [BB/BB/BB/BB/../../../..][../../../../../../../..]
[compute-3-58.local:08478] MCW rank 1 bound to socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: [../../../../BB/BB/BB/BB][../../../../../../../..]
[compute-3-58.local:08478] MCW rank 2 bound to socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]]: [../../../../../../../..][BB/BB/BB/BB/../../../..]
[compute-3-58.local:08478] MCW rank 3 bound to socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: [../../../../../../../..][../../../../BB/BB/BB/BB]
However, when I try on a multi-node job (2 x 16 core nodes, 8 tasks, 4 cores per task), it fails with an error:
--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
processes than cpus on a resource:
Bind to: CORE
Node: compute-3-56
#processes: 2
#cpus: 1
You can override this protection by adding the "overload-allowed"
option to your binding directive.
--------------------------------------------------------------------------
This is my job script, and the partition requested has 16 physical core (32 w/ hyperthreading) nodes:
#!/bin/bash
#SBATCH --ntasks=32
#SBATCH --ntasks-per-core=1
#SBATCH --partition=n2016
#SBATCH --partition=n2016
#SBATCH --time 0:01:00
#SBATCH --output vasp_test_distribution.32.stdout
#SBATCH --error vasp_test_distribution.32.stderr
mpirun --version
mpirun -np 8 --map-by slot:PE=4 --bind-to core --report-bindings hostname
I'm not sure if this is an mpirun mapping/binding bug or just a gap in the documentation, but given that this seems like an obvious layout for a mixed MPI/OpenMP job, I think it's worth making it more clear how to do it. I might even be useful to explicitly mention something like "mixed MPI/OpenMP" in the mpirun man page, to make it easier to find.