NWChem Shifter image fails with MPI errors #775

danielpert · 2023-05-03T16:59:34Z

Describe the bug
When I try to run a geometry optimization followed by DFT frequency calculation, the program fails after the geometry optimization. The last thing in the .out file is "Multipole analysis of the density". The error message I am getting is:

MPICH ERROR [Rank 63] [job id 8408833.0] [Wed May  3 01:50:11 2023] [nid005166] - Abort(874629263) (rank 63 in comm 496): Fatal error in PMPI_Recv: Other MPI error, error stack:
PMPI_Recv(177).................: MPI_Recv(buf=0x7f53e3523c58, count=8, MPI_CHAR, src=96, tag=27624, comm=0x84000001, status=0x7fffeec663b0) failed
MPIR_Wait_impl(41).............:
MPID_Progress_wait(184)........:
MPIDI_Progress_test(80)........:
MPIDI_OFI_handle_cq_error(1062): OFI poll failed (ofi_events.c:1064:MPIDI_OFI_handle_cq_error:Message too long - OK)

Describe settings used
I am using these environment variables:
export OMP_NUM_THREADS=2
export OMP_PROC_BIND=spread
export MPICH_GNI_MAX_EAGER_MSG_SIZE=131026
export MPICH_GNI_NUM_BUFS=80
export MPICH_GNI_NDREG_MAXSIZE=16777216
export MPICH_GNI_MBOX_PLACEMENT=nic
export MPICH_GNI_RDMA_THRESHOLD=65536
export COMEX_MAX_NB_OUTSTANDING=6

At first I got this error after 3 minutes:

[191] ../../ga-5.8.1/comex/src-mpi-pr/comex.c:3337: _put_handler: Assertion `reg_entry' failed[191] Received an Error in Communication: (-1) comex_assert_fail
MPICH ERROR [Rank 191] [job id 8235628.0] [Fri Apr 28 12:44:46 2023] [nid005178] - Abort(-1) (rank 191 in comm 496): application called MPI_Abort(comm=0x84000001, -1) - process 191

srun: error: nid005178: tasks 160,162,168,178,186: Exited with exit code 255
srun: Terminating StepId=8235628.0
[127] header operation not recognized: -431467067
[127] ../../ga-5.8.1/comex/src-mpi-pr/comex.c:3277: _progress_server: Assertion `0' failed[127] Received an Error in Communication: (-1) comex_assert_fail
MPICH ERROR [Rank 127] [job id 8235628.0] [Fri Apr 28 12:44:47 2023] [nid005146] - Abort(-1) (rank 127 in comm 496): application called MPI_Abort(comm=0x84000001, -1) - process 127

I added these environmental variables with help from @lastephey which allowed the geometry optimization to run but then I got the error described above when it tried to start calculating the vibrational frequencies:
export CXI_FORK_SAFE=1
export CXI_FORK_SAFE_HP=1
export FI_CXI_RX_MATCH_MODE=hybrid
export FI_CXI_DEFAULT_CQ_SIZE=128000

Report what operating system and distribution you are using.
SUSE Linux Enterprise Server 15 SP4

Attach log files
files.zip contains my submission script, nwchem input, starting geometry, and stdout/stderr

stdout/stderr of the NWChem execution
complete makefile log
$NWCHEM_TOP/src/tools/build/config.log
$NWCHEM_TOP/src/tools/build/comex/config.log
I could not find these config.log files but I am using the docker image ghcr.io/nwchemgit/nwchem-702.mpipr.nersc:latest
debugging stack

To Reproduce

Steps to reproduce the behavior:
Run NWchem using the attached input and environment variables with the docker image
Attach all the input files required to run.

Expected behavior
I expected the program to complete and calculate the energy and frequencies.

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
I am running this on the Perlmutter cluster at the National Energy Research Scientific Computing Center (NERSC)

The text was updated successfully, but these errors were encountered:

lastephey · 2023-05-03T18:32:55Z

Thanks @danielpert. That new OFI error message is interesting. It looks like others have encountered it on Perlmutter/Crusher at OLCF. On Perlmutter they suggested two fixes:

export MPICH_COLL_SYNC=MPI_Bcast

or

export FI_CXI_DEFAULT_CQ_SIZE=71680
export FI_CXI_REQ_BUF_SIZE=12582912
export FI_UNIVERSE_SIZE=4096

Based on this comment it sounds like the first method was more reliable. Would you be willing to test?

Relevant issues:
Crusher
Perlmutter

jeffhammond · 2023-05-03T20:20:59Z

It would surprise me a lot if Bcast synchronization mattered to NWChem. It doesn't use it on the critical path anywhere I've read.

With MPI-PR, you'll want to look at settings that impact send-receive flow control.

lastephey · 2023-05-03T22:05:06Z

Thanks. I don't have enough knowledge to know if

this kind of behavior suggests that we just need to find the correct setting (and if so, I'd appreciate any pointers) or
this could reflect a problem with our network. If it's the latter, it would be helpful to know as soon as possible so we can engage with our vendor.

danielpert · 2023-05-05T19:14:03Z

sorry for the delay, the job was waiting a long time on the queue and then perlmutter was also down for a bit. I tested with export MPICH_COLL_SYNC=MPI_Bcast and it failed with one of the same errors I saw before:

[63] header operation not recognized: -212919191
[63] ../../ga-5.8.1/comex/src-mpi-pr/comex.c:3277: _progress_server: Assertion `0' failed[63] Received an Error in Communication: (-1) comex_assert_fail
MPICH ERROR [Rank 63] [job id 8441431.0] [Fri May  5 09:40:55 2023] [nid005664] - Abort(-1) (rank 63 in comm 496): application called MPI_Abort(comm=0x84000001, -1) - process 63

I will also test with the other method

edoapra · 2023-05-08T17:11:00Z

@danielpert
Could you try the following Slurm script that uses a NWChem 7.2.0 Shifter image (nacl16_co.nw is the input file name in this example)?

#!/bin/bash
#SBATCH -C cpu
#SBATCH -t 0:29:00
#SBATCH -q debug
#SBATCH -N 8
#SBATCH -A XXXX
#SBATCH --cpus-per-task=2
#SBATCH --ntasks-per-node=64
#SBATCH -J nacl16_1co
#SBATCH -o nacl16_1co.%j.out
#SBATCH -e nacl16_1co.%j.out
#SBATCH --image=ghcr.io/nwchemgit/nwchem-dev.nersc.mpich4.mpi-pr:20230203_160345
echo image nwchemgit/nwchem-dev 20230203_160345
module purge
module load PrgEnv-gnu
module load cudatoolkit
module load cray-pmi
module list
export OMP_NUM_THREADS=1
export OMP_PROC_BIND=true
export COMEX_MAX_NB_OUTSTANDING=6
export FI_CXI_RX_MATCH_MODE=hybrid
export COMEX_EAGER_THRESHOLD=16384
export FI_CXI_RDZV_THRESHOLD=16384
export FI_CXI_OFLOW_BUF_COUNT=6
export MPICH_SMP_SINGLE_COPY_MODE=CMA
srun -N $SLURM_NNODES --cpu-bind=cores shifter --module=mpich nwchem nacl16_1co.nw

danielpert · 2023-05-09T03:11:01Z

I tried that, the job did not fail but kind of just stopped and didn't do anything until it hit the wall time. I got this warning:

PE 191: MPICH WARNING: OFI is failing to make progress on posting a receive. MPICH suspects a hang due to completion queue exhaustion. Setting environment variable FI_CXI_DEFAULT_CQ_SIZE to a higher number might circumvent this scenario. OFI retry continuing...

I set FI_CXI_DEFAULT_CQ_SIZE=71680 but got the same issue.

danielpert · 2023-05-09T03:13:11Z

I can try increasing it more to 143360?

danielpert · 2023-05-09T03:18:12Z

I also got this message:

Unloading the cpe module is insufficient to restore the system defaults.
Please run 'source
/opt/cray/pe/cpe/23.03/restore_lmod_system_defaults.[csh|sh]'.

I can try adding source /opt/cray/pe/cpe/23.03/restore_lmod_system_defaults.sh to my script after module purge. This seems to work when I run in the terminal without any warnings. Not sure if this is the issue though

Update: I am still getting the same issue

edoapra · 2023-05-10T01:18:12Z

@danielpert I have a fix for the poorly parallelized code that was causing the error posted in #775 (comment)

This fix is applied to the image ghcr.io/edoapra/nwchem-721.nersc.mpich4.mpi-pr:20230509_14411

Could you please try the same Slurm batch script I posted in #775 (comment) with the new shifter image?

#SBATCH --image=ghcr.io/edoapra/nwchem-721.nersc.mpich4.mpi-pr:20230509_14411
echo image ghcr.io/edoapra/nwchem-721.nersc.mpich4.mpi-pr:20230509_14411

danielpert · 2023-05-10T01:43:51Z

I cannot seem to use that image, when I submit the submission script I get this error
sbatch: error: Failed to lookup image. Aborting.

edoapra · 2023-05-10T04:56:47Z

Sorry about giving the wrong image name. I missed one last 1 character
Here are correct lines for the Slurm script

#SBATCH --image=ghcr.io/edoapra/nwchem-721.nersc.mpich4.mpi-pr:20230509_144111
echo image ghcr.io/edoapra/nwchem-721.nersc.mpich4.mpi-pr:20230509_144111

danielpert · 2023-05-10T14:16:49Z

yes when I try that image my job runs successfully!

edoapra · 2023-05-10T18:29:20Z

yes when I try that image my job runs successfully!

Thank you very much for this feedback. Let me do more testing on this change just to be sure it does not break any other functionality.

edoapra · 2023-05-20T00:51:29Z

This fix is now present in the default NERSC Shifter images
ghcr.io/nwchemgit/nwchem-720.nersc.mpich4.mpi-pr:latest
ghcr.io/nwchemgit/nwchem-dev.nersc.mpich4.mpi-pr:latest

edoapra · 2023-05-23T15:31:37Z

The NERSC documentation for NWChem was updated with information about the current Shifter information for Perlmutter

https://docs.nersc.gov/applications/nwchem/#slurm-script-for-nwchem-shifter-image-on-perlmutter-cpus

lastephey · 2023-05-23T16:35:31Z

Thanks @edoapra!

Just a heads up that we're working on a new container runtime called podman-hpc: https://docs.nersc.gov/development/podman-hpc/overview/

It's still in an early phase with several known issues, but I wanted to put it on your radar since we may eventually retire Shifter in favor of podman-hpc (timeframe ~years, so no urgent action required).

edoapra · 2023-05-23T17:04:20Z

Thanks @edoapra!

Just a heads up that we're working on a new container runtime called podman-hpc: https://docs.nersc.gov/development/podman-hpc/overview/

It's still in an early phase with several known issues, but I wanted to put it on your radar since we may eventually retire Shifter in favor of podman-hpc (timeframe ~years, so no urgent action required).

Is podman available for any user on Perlmutter at this point in time?

lastephey · 2023-05-23T17:06:32Z

Yes, it's open to all users without any additional configuration required. Anyone can test today.

edoapra changed the title ~~NWChem docker image fails with MPI errors~~ NWChem Shifter image fails with MPI errors May 9, 2023

edoapra added a commit to edoapra/nwchem that referenced this issue May 10, 2023

fix perlmutter parallelization issue nwchemgit#775 (comment)

e7c626e

edoapra mentioned this issue May 10, 2023

updates #777

Merged

edoapra added a commit that referenced this issue May 12, 2023

fix perlmutter parallelization issue #775 (comment)

3ffb560

edoapra closed this as completed Jun 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NWChem Shifter image fails with MPI errors #775

NWChem Shifter image fails with MPI errors #775

danielpert commented May 3, 2023

lastephey commented May 3, 2023

jeffhammond commented May 3, 2023

lastephey commented May 3, 2023

danielpert commented May 5, 2023

edoapra commented May 8, 2023

danielpert commented May 9, 2023

danielpert commented May 9, 2023

danielpert commented May 9, 2023 •

edited

Loading

edoapra commented May 10, 2023

danielpert commented May 10, 2023

edoapra commented May 10, 2023

danielpert commented May 10, 2023

edoapra commented May 10, 2023

edoapra commented May 20, 2023

edoapra commented May 23, 2023

lastephey commented May 23, 2023

edoapra commented May 23, 2023

lastephey commented May 23, 2023

NWChem Shifter image fails with MPI errors #775

NWChem Shifter image fails with MPI errors #775

Comments

danielpert commented May 3, 2023

lastephey commented May 3, 2023

jeffhammond commented May 3, 2023

lastephey commented May 3, 2023

danielpert commented May 5, 2023

edoapra commented May 8, 2023

danielpert commented May 9, 2023

danielpert commented May 9, 2023

danielpert commented May 9, 2023 • edited Loading

edoapra commented May 10, 2023

danielpert commented May 10, 2023

edoapra commented May 10, 2023

danielpert commented May 10, 2023

edoapra commented May 10, 2023

edoapra commented May 20, 2023

edoapra commented May 23, 2023

lastephey commented May 23, 2023

edoapra commented May 23, 2023

lastephey commented May 23, 2023

danielpert commented May 9, 2023 •

edited

Loading