-
Notifications
You must be signed in to change notification settings - Fork 280
Using MPICH on Aurora@ALCF
This page describes how to build and use MPICH on the 'Aurora' machine at Argonne. Aurora is uses Intel CPUs and GPUs with the Cray Slingshot interconnect. Support for Slingshot is provided via the system libfabric.
As of 03/01/2024, it is best to build the MPICH git main
branch for use on Aurora.
./autogen.sh
./configure --prefix=/home/raffenet/proj/mpich/i --with-device=ch4:ofi --with-libfabric=/opt/cray/libfabric/1.15.2.0 --with-ze --with-pm=no --with-pmi=pmix --with-pmix=/usr CC=icx CXX=icpx FC=ifx
make -j16 install
A correctly configured MPICH build should print the following message in configure output.
*****************************************************
***
*** device : ch4:ofi
*** shm feature : auto
*** gpu support : ZE
***
*****************************************************
On Aurora, MPICH is configured without a process manager (--with-pm=no
). We instead can use the mpiexec
that is available as part of Cray's PALs package. Setting PALS_PMI=pmix
in the execution environment is required for processes to properly query the runtime for job information. If launching processes on a single host, it is required to set MPIR_CVAR_SINGLE_HOST_ENABLED=0
in the environment.
- gdb: but it's not called gdb. it's
gdb-oneapi
- gdb4hpc: pretty nice parallel wrapper around gdb, though it will probably timeout if you have more than a few dozen processes
- It's possible (maybe only with large counts) to trigger an unaligned memory error in the ZE code:
.../modules/yaksa/src/backend/ze/hooks/yaksuri_zei_type_hooks.c:101:30: runtime error: load of misaligned address 0xfbc50007672a3559 for type 'struct _ze_module_handle_t *', which requires 8 byte alignment
0xfbc50007672a3559: note: pointer points here
<memory cannot be printed>
Segmentation fault
If you don't need GPU processing you can work around this by configuring with --without-ze