Skip to content

Likwid Mpirun

Thomas Roehl edited this page Nov 9, 2019 · 8 revisions

likwid-mpirun: enable simple pinning for MPI and hybrid MPI/threaded applications

Introduction

Pinning to dedicated compute resources is important for pure MPI and even more for hybrid MPI/threaded applications. While all major MPI implementations include their mechanism for pinning, likwid-mpirun provides a simple and portable solution based on the powerful capabilities of likwid-pin. This is still experimental at the moment. Still it can be adapted to any MPI and OpenMP combination with the help of a tuning application in the test directory of LIKWID. likwid-mpirun works in conjunction with PBS, LoadLeveler and SLURM. The tested MPI and compilers are Intel C/C++ compiler, GCC, Intel MPI and OpenMPI. The support for mvapich is untested.

Usage

As usual you can get a help message with

$ likwid-mpirun -h

You always have to specify the total number of MPI processes with the -np NUMPROC. Two cases are distinguished: Pure MPI and hybrid applications.

Pure MPI:

$ likwid-mpirun -np 16 ./a.out

This will start 16 processes, the number of processes per compute node is calculated from the PBS/LoadLeveler/SLURM node file. If two hosts are given, eight processes are pinned to cores/SMT threads per node. The pinning is implemented with the likwid-pin node domain.

Pure MPI with explicit pinning:

$ likwid-mpirun -np 16 -nperdomain S:2 ./a.out

For this case a single option -nperdomain covers all cases. The argument contains a domain character as already known from the other LIKWID applications and the number per domain separated by a colon. Above example will start two processes per socket up to 16 processes and will pin the processes with likwid-pin.

Domains can be:

  • N - for node
  • S - for socket
  • C - for last level shared cache
  • M - for NUMA domain (interesting e.g. for AMD Magny Cours)

For pinning on Magny Cours the following can be useful:

$ likwid-mpirun -np 16 -nperdomain M:2 ./a.out

This will start 2 processes per NUMA domain. On a two socket AMD MagnyCours system this will result in 8 processes per node with two nodes total for this run.

For debugging use the debug option:

$ likwid-mpirun -debug -np 16 -nperdomain M:2 ./a.out

This will output all command which would be executed.

Pinning of hybrid applications:

$ likwid-mpirun  -np 16 -pin S0:0,1_S1:0,1 ./a.out

Hybrid pinning has only one option covering all possibilities with -pin. The argument string to pin consists of valid likwid-pin expressions separated by underscores. The number of separated expression denote the number of processes started per node. Above example will start two processes per node. The first process and its threads (two) will be pinned to Socket one, core 0,1. The second process and its threads will be pinned to socket two, core 0,1. Consequently, the above statement requires 4 hosts to run.

The main pinning complexity is that the OpenMP as well as the MPI implementation could start their own threads for management purpose. These threads need to be skipped and their position in the started threads has to be determined in advance. For the tested MPI+Compiler combinations, the skip masks are integrated into likwid-mpirun.

At the moment all pinning uses block distribution, round robin variants for node and global are planned.

Options

-h, --help		 Help message
-v, --version		 Version information
-d, --debug		 Debugging output
-n/-np <count>		 Set the number of processes
-nperdomain <domain>	 Set the number of processes per node by giving an affinity domain and count
-pin <list>		 Specify pinning of threads. CPU expressions like likwid-pin separated with '_'
-d, --dist <count>(:<order>) Specify distance between MPI processes. Orders can be 'close' or 'spread'. Default is 'close'.
-t, -tpp <count>		 Set number of thread for each process
-s, --skip <hex>	 Bitmask with threads to skip
-mpi <id>		 Specify which MPI should be used. Possible values: openmpi, intelmpi, slurm and mvapich2
			 If not set, module system is checked
-omp <id>		 Specify which OpenMP should be used. Possible values: gnu and intel
			 Only required for statically linked executables.
-hostfile                Use custom hostfile instead of searching the environment
-g/-group <perf>	 Set a likwid-perfctr conform event set for measuring on nodes
-m/-marker               Activate marker API mode

MPI not recognized

likwid-mpirun checks for some known MPI implementations (OpenMPI, IntelMPI and Mvapich2) in the file system and the module system. It searches for the executables like mpiexec in the path that can be either in the environment variable MPIHOME, MPI_ROOT or MPI_BASE. If it does not find it, try to set it on the command line with -mpi [openmpi, intelmpi, mvapich2 or slurm].

Hostfile format

If you are running in a batch job environment that is supported by likwid-mpirun the hosts are read from the batch system. In cases where you run it interactively or in an unsupported batch job environment, you have to generate a valid hostfile for likwid-mpirun. The syntax is very simple: List a hostname as many times as the host has slots.

localhost
localhost
localhost
host1
host2
host2

There are three slots on localhost, one slot on host1 and two slots on host2.

Performance measurements of MPI and hybrid applications

Besides the correct pinning of MPI processes and their threads, the application execution can be measured using likwid-perfctr. By setting a performance group or custom event set on the command line, the call of likwid-pin is substituted with likwid-perfctr. By now, you can perform end-to-end measurements and instrumented code using the LIKWID Marker API.

Measure the double-precision floating-point operations used by all participating systems running a hybrid application with one MPI process per socket and 10 threads per MPI process:

$ likwid-mpirun -pin S0:0-9_S1:0-9 -g FLOPS_DP ./a.out

Measure the energy used by all participating systems running one process per socket:

$ likwid-mpirun -nperdomain S:1 -g ENERGY ./a.out

likwid-mpirun is intelligent enough to measure socket-wide performance counters on one CPU, the others skip the reading of the hardware registers, they just read the core-local performance counters.

When measuring is activated, no overloading of the hosts is allowed. Multiple processes would read the hardware performance counters so that the final results wouldn't be valid anymore. There are plans to substitute likwid-perfctr with likwid-pin for the overloaded processes.

Using likwid-mpirun with SLURM job scheduler

likwid-mpirun is able to run applications through SLURM.

$ salloc -N X
$ likwid-mpirun -np 2 ./a.out

likwid-mpirun recognizes the SLURM environment and calls srun instead of mpiexec or mpirun. You can see the srun command when using the -d command line switch. Some MPI implementations require special parameters and there is currently no way to add custom options to srun. One common switch is --mpi=pmi2 (at least on our cluster). You can either change the Lua code (likwid-4.3.3: cp $(which likwid-mpirun) .; vi -n 592 likwid-mpirun; ./likwid-mpirun ...) or you set the environment variable SLURM_MPI_TYPE=pmi2 before running likwid-mpirun.

In some rare cases it might be required to use the MPI implementation specific way of starting applications (mpiexec, mpirun, ...). You can force using this way by using the --mpi command line switch.

Clone this wiki locally