Skip to content

mpirun args for multi-node run with multiple cores per MPI task #10071

Open
@bernstei

Description

@bernstei

I'm using OpenMPI 4.1.1 with SLURM (CentOS 7), and I can't figure out how to run with a total n_mpi_tasks = nslots / cores_per_task and binding each MPI task to a contiguous set of cores_per_task cores. The documentation suggests that I need mpirun -np n_mpi_tasks --map-by slot:PE=cores_per_task --bind-to core. When I try this for a single node job (16 slots, 4 cores per task, 4 tasks), it works fine. The bindings report shows 4 MPI tasks, one bound to the first 4 physical cores, the second to the next 4, etc.

[compute-3-58.local:08478] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]]: [BB/BB/BB/BB/../../../..][../../../../../../../..]
[compute-3-58.local:08478] MCW rank 1 bound to socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: [../../../../BB/BB/BB/BB][../../../../../../../..]
[compute-3-58.local:08478] MCW rank 2 bound to socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]]: [../../../../../../../..][BB/BB/BB/BB/../../../..]
[compute-3-58.local:08478] MCW rank 3 bound to socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: [../../../../../../../..][../../../../BB/BB/BB/BB]

However, when I try on a multi-node job (2 x 16 core nodes, 8 tasks, 4 cores per task), it fails with an error:

--------------------------------------------------------------------------
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to:     CORE
   Node:        compute-3-56
   #processes:  2
   #cpus:       1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--------------------------------------------------------------------------

This is my job script, and the partition requested has 16 physical core (32 w/ hyperthreading) nodes:

#!/bin/bash
#SBATCH --ntasks=32
#SBATCH --ntasks-per-core=1
#SBATCH --partition=n2016
#SBATCH --partition=n2016
#SBATCH --time 0:01:00
#SBATCH --output vasp_test_distribution.32.stdout
#SBATCH --error vasp_test_distribution.32.stderr

mpirun --version

mpirun -np 8 --map-by slot:PE=4 --bind-to core --report-bindings hostname

I'm not sure if this is an mpirun mapping/binding bug or just a gap in the documentation, but given that this seems like an obvious layout for a mixed MPI/OpenMP job, I think it's worth making it more clear how to do it. I might even be useful to explicitly mention something like "mixed MPI/OpenMP" in the mpirun man page, to make it easier to find.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions