Skip to content

Latest commit

 

History

History
240 lines (182 loc) · 10.7 KB

6_MPI.md

File metadata and controls

240 lines (182 loc) · 10.7 KB

MPI

This section will deal with running MPI jobs on BlueCrystal. It will cover the mpirun launcher used to execute parallel jobs and how this interacts with the queueing system. It is not an MPI programming tutorial, and it requires you to already be familiar with MPI terminology, e.g. what is a rank and how it relates to threads, processes, and compute nodes.

The last part of this section lists some useful documentation links.

MPI Implementations

When compiling MPI programs, you will need to choose an MPI implementation. The most common choices are Open MPI, MPICH, and Intel MPI. The first two are open-source libraries which you can install on your own machine, and all three are available on BlueCrystal. While feature-wise they should be largely equivalent, some options may differ in name and performance might vary. You are encouraged to explore all the available options and discover any differences on your own.

BCp3

On Phase 3, three MPI implementations are available as modules:

  • The latest version of Intel MPI is part of the same module as the compiler (languages/intel-compiler-16-u2), so you don't need to load an additional module.
    • However, if you want a more advanced setup, older versions are available as separate modules:
    intel-mpi/64/4.0.3/008
    intel-mpi/64/4.1.0/024
    
  • Open MPI built with the GNU and Intel compilers:
    openmpi/gcc/64/1.6.4
    openmpi/gcc/64/1.6.5
    openmpi/gcc/64/2.1.1
    openmpi/intel/64/1.6.5
    
  • MPICH built with GCC:
    mpich/ge/gcc/64/1.2.7
    mpich/ge/open64/64/1.2.7
    mpich2/ge/gcc/64/1.4.1p1
    

If you require features that are only available in newer versions of Open MPI or MPICH, you can build from source.

BCp4

On Phase 4, use the Intel MPI. It is part of the compiler modules, e.g. languages/intel/2018-u3.

Compiling MPI programs

The choice of MPI library is (mostly) independent of the compiler choice. Therefore, you should be able to use (for example) the GNU compiler with any of the MPI libraries listed above. In practice, there are sometimes issues when you use a proprietary MPI implementation with compilers from other vendors. Although unlikely, this means that you may encounter issues when using, for example, Intel MPI with GCC. However, this shouldn't deter you from exploring your options!

MPI implementations generally provide a compiler wrapper, which is a command that calls the underlying compiler with the parameters necessary for the MPI code. The advantage of using this wrapper is that you don't need to manually pass the compiler and linker flags for the library, and you can change to a different implementation without changing your build command.

For the open-source libraries, commands are usually named as follows:

Language Command
C mpicc
C++ mpicxx
Fortran mpif90

If you use Intel software, the commands above will use Intel MPI with the GNU compilers. If you want to use both Intel MPI and Compilers, the commands are named by joining mpi with the normal Intel Compiler command:

Language Command
C mpiicc
C++ mpiicpc
Fortran mpiifort

If in doubt what the right command is called, first load the modules for your desired compiler and MPI library, then use your shell's autocomplete to list the available options:

$ mpi<TAB><TAB>
mpicc           mpiicc          mpiicpc         mpigcc
mpigxx          mpicxx          mpiexec         mpivars.sh
mpif77          mpif90          mpiifort

Then, run with -v to check what compiler and library will be used:

$ mpicc -v
mpigcc for the Intel(R) MPI Library 5.1.3 for Linux*
Copyright(C) 2003-2015, Intel Corporation.  All rights reserved.
...
gcc version 7.1.0 (GCC)

$ mpiicc -v
mpiicc for the Intel(R) MPI Library 5.1.3 for Linux*
Copyright(C) 2003-2015, Intel Corporation.  All rights reserved.
icc version 16.0.2 (gcc version 7.1.0 compatibility)

Once you know which wrapper to use, just replace the regular compiler command:

# Without MPI
$ gcc -o test-nompi test.c

# With MPI
$ mpicc -o test-mpi test.c

Running MPI jobs

MPI applications generally run several processes, which need to be orchestrated, e.g. started, synchronised, and terminated. This is done by using an MPI launcher, usually called mpirun or mpiexec (synonyms), which creates as many instances of your application as you instruct it and manages the processes over their lifetime.

The simplest MPI launch specifies only the number of ranks (-np) and the command to run:

$ mpirun -np 4 ./test-mpi

However, there are many more options, and you should read about them in mpirun --help and in the online documentation: Open MPI, MPICH, Intel MPI. Make sure that you look at right documentation for the version of the library you are using, and keep in mind that some options may differ between implementations.

Important: In order to run more than a single MPI rank, you need to use a launcher. If you don't—and just run your binary directly—only a single instance will run, so any MPI code will be redundant.

SLURM and BCp4

SLURM provides its own parallel launcher, called srun. The degree to which this integrates with the available hardware and software varies, but in general you can replace mpirun run with srun and expect everything to work fine. Some SLURM systems, e.g. CS-series Crays, don't provide mpirun and require you to use srun.

The advantage of using srun is that it automatically reads your run configuration from your job script, so you usually don't need to specify any additional parameters. The following example script, which only uses sbatch arguments and passes just the binary to srun, runs 8 MPI processes evenly split between two nodes:

#SBATCH --nodes 2
#SBATCH --ntasks-per-node 4

srun ./test-mpi

On BCp4, you can use both mpirun and srun. We recommend using srun, because your parallel configuration will be automatically read from your job script, so you won't have to repeat it in mpirun arguments.

Important: If you decide to use srun, you will need to add export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so to your job script before your srun line(s). There is a configuration issue with causes Intel MPI to crash if used under srun without setting this environment variable first.

Tagging output

One useful setting for debugging is tagging each line of output with the number of the rank that produced it. Since the order in which print statements will be executed across several ranks is not fixed, this will help you identify which rank printed each line:

# Without tags
$ mpirun -np 4 ./hi
Hello from rank 2, on node31-031.
Hello from rank 0, on node31-031.
Hello from rank 3, on node31-031.
Hello from rank 1, on node31-031.

# With tags
$ mpirun -np 4 --tag-output ./hi
[1,2]<stdout>:Hello from rank 2, on node31-031.
[1,0]<stdout>:Hello from rank 0, on node31-031.
[1,3]<stdout>:Hello from rank 3, on node31-031.
[1,1]<stdout>:Hello from rank 1, on node31-031.

The --tag-output option works with Open MPI; with Intel MPI, use -l. Other libraries likely have similar options, but they might have different names.

Binding processes

By default, mpirun will allow you to launch one rank per CPU core available. However, sometimes you may want to launch fewer ranks than you have cores, e.g. because each rank might run several threads internally, or more ranks per core, particularly if your processor supports simultaneous multithreading and you want to run a rank per hardware thread. The process of assigning processes (or threads) to hardware resources is commonly referred to as binding.

Most MPI implementations offer some support for binding ranks as part of the launcher. For example, you can restrict each rank to run on a specific core (as opposed to any core, which is the default):

# Without binding
$ mpirun -np 4 ./hi
Hello from rank 2, on node31-031. (core affinity = 0-15)
Hello from rank 0, on node31-031. (core affinity = 0-15)
Hello from rank 3, on node31-031. (core affinity = 0-15)
Hello from rank 1, on node31-031. (core affinity = 0-15)

# With binding
$ mpirun -np 4 -bind-to-core ./hi
Hello from rank 0, on node31-031. (core affinity = 0)
Hello from rank 1, on node31-031. (core affinity = 1)
Hello from rank 2, on node31-031. (core affinity = 2)
Hello from rank 3, on node31-031. (core affinity = 3)

Another example is splitting the total number of processes between several nodes:

# Without mapping (all on first node)
$ mpirun -np 4 ./hi
Hello from rank 0, on compute091.
Hello from rank 1, on compute091.
Hello from rank 2, on compute091.
Hello from rank 3, on compute091.

# With mapping (split across 2 nodes)
$ mpirun -np 4 -npernode 2 ./hi
Hello from rank 0, on compute091.
Hello from rank 1, on compute091.
Hello from rank 2, on compute092.
Hello from rank 3, on compute092.

There are many more options available, and they are all explained in the manuals. As above, the options may slightly differ with the implementation use.

MPI examples

We have provided a set of working MPI examples, ranging from a simple "Hello World" MPI program to an implementation of the "halo exchange" message passing pattern you need for the MPI assignment.

Further reference

Here are some handy links to MPI docs:

You can find some MPI programming tutorials on the MPICH guides page. The MPI standard spec is also available online.