mpirun args for multi-node run with multiple cores per MPI task

I'm using OpenMPI 4.1.1 with SLURM (CentOS 7), and I can't figure out how to run with a total `n_mpi_tasks = nslots / cores_per_task` and binding each MPI task to a contiguous set of `cores_per_task` cores. The documentation suggests that I need `mpirun -np n_mpi_tasks --map-by slot:PE=cores_per_task --bind-to core`.  When I try this for a single node job (16 slots, 4 cores per task, 4 tasks), it works fine.  The bindings report shows 4 MPI tasks, one bound to the first 4 physical cores, the second to the next 4, etc.
```
[compute-3-58.local:08478] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]]: [BB/BB/BB/BB/../../../..][../../../../../../../..]
[compute-3-58.local:08478] MCW rank 1 bound to socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: [../../../../BB/BB/BB/BB][../../../../../../../..]
[compute-3-58.local:08478] MCW rank 2 bound to socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]]: [../../../../../../../..][BB/BB/BB/BB/../../../..]
[compute-3-58.local:08478] MCW rank 3 bound to socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: [../../../../../../../..][../../../../BB/BB/BB/BB]
```

 However, when I try on a multi-node job (2 x 16 core nodes, 8 tasks, 4 cores per task), it fails with an error:
```
--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to:     CORE
   Node:        compute-3-56
   #processes:  2
   #cpus:       1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--------------------------------------------------------------------------
```

This is my job script, and the partition requested has 16 physical core (32 w/ hyperthreading) nodes:
```
#!/bin/bash
#SBATCH --ntasks=32
#SBATCH --ntasks-per-core=1
#SBATCH --partition=n2016
#SBATCH --partition=n2016
#SBATCH --time 0:01:00
#SBATCH --output vasp_test_distribution.32.stdout
#SBATCH --error vasp_test_distribution.32.stderr

mpirun --version

mpirun -np 8 --map-by slot:PE=4 --bind-to core --report-bindings hostname
```

I'm not sure if this is an mpirun mapping/binding bug or just a gap in the documentation, but given that this seems like an obvious layout for a mixed MPI/OpenMP job, I think it's worth making it more clear how to do it.  I might even be useful to explicitly mention something like "mixed MPI/OpenMP" in the mpirun man page, to make it easier to find. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

mpirun args for multi-node run with multiple cores per MPI task #10071

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

mpirun args for multi-node run with multiple cores per MPI task #10071

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions