Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NWChem Shifter image fails with MPI errors #775

Closed
danielpert opened this issue May 3, 2023 · 18 comments
Closed

NWChem Shifter image fails with MPI errors #775

danielpert opened this issue May 3, 2023 · 18 comments

Comments

@danielpert
Copy link

Describe the bug
When I try to run a geometry optimization followed by DFT frequency calculation, the program fails after the geometry optimization. The last thing in the .out file is "Multipole analysis of the density". The error message I am getting is:

MPICH ERROR [Rank 63] [job id 8408833.0] [Wed May  3 01:50:11 2023] [nid005166] - Abort(874629263) (rank 63 in comm 496): Fatal error in PMPI_Recv: Other MPI error, error stack:
PMPI_Recv(177).................: MPI_Recv(buf=0x7f53e3523c58, count=8, MPI_CHAR, src=96, tag=27624, comm=0x84000001, status=0x7fffeec663b0) failed
MPIR_Wait_impl(41).............:
MPID_Progress_wait(184)........:
MPIDI_Progress_test(80)........:
MPIDI_OFI_handle_cq_error(1062): OFI poll failed (ofi_events.c:1064:MPIDI_OFI_handle_cq_error:Message too long - OK)

Describe settings used
I am using these environment variables:
export OMP_NUM_THREADS=2
export OMP_PROC_BIND=spread
export MPICH_GNI_MAX_EAGER_MSG_SIZE=131026
export MPICH_GNI_NUM_BUFS=80
export MPICH_GNI_NDREG_MAXSIZE=16777216
export MPICH_GNI_MBOX_PLACEMENT=nic
export MPICH_GNI_RDMA_THRESHOLD=65536
export COMEX_MAX_NB_OUTSTANDING=6

At first I got this error after 3 minutes:

[191] ../../ga-5.8.1/comex/src-mpi-pr/comex.c:3337: _put_handler: Assertion `reg_entry' failed[191] Received an Error in Communication: (-1) comex_assert_fail
MPICH ERROR [Rank 191] [job id 8235628.0] [Fri Apr 28 12:44:46 2023] [nid005178] - Abort(-1) (rank 191 in comm 496): application called MPI_Abort(comm=0x84000001, -1) - process 191

srun: error: nid005178: tasks 160,162,168,178,186: Exited with exit code 255
srun: Terminating StepId=8235628.0
[127] header operation not recognized: -431467067
[127] ../../ga-5.8.1/comex/src-mpi-pr/comex.c:3277: _progress_server: Assertion `0' failed[127] Received an Error in Communication: (-1) comex_assert_fail
MPICH ERROR [Rank 127] [job id 8235628.0] [Fri Apr 28 12:44:47 2023] [nid005146] - Abort(-1) (rank 127 in comm 496): application called MPI_Abort(comm=0x84000001, -1) - process 127

I added these environmental variables with help from @lastephey which allowed the geometry optimization to run but then I got the error described above when it tried to start calculating the vibrational frequencies:
export CXI_FORK_SAFE=1
export CXI_FORK_SAFE_HP=1
export FI_CXI_RX_MATCH_MODE=hybrid
export FI_CXI_DEFAULT_CQ_SIZE=128000

Report what operating system and distribution you are using.
SUSE Linux Enterprise Server 15 SP4

Attach log files
files.zip contains my submission script, nwchem input, starting geometry, and stdout/stderr

  • stdout/stderr of the NWChem execution
  • complete makefile log
  • $NWCHEM_TOP/src/tools/build/config.log
  • $NWCHEM_TOP/src/tools/build/comex/config.log
  • I could not find these config.log files but I am using the docker image ghcr.io/nwchemgit/nwchem-702.mpipr.nersc:latest
  • debugging stack

To Reproduce

  1. Steps to reproduce the behavior:
    Run NWchem using the attached input and environment variables with the docker image
  2. Attach all the input files required to run.

Expected behavior
I expected the program to complete and calculate the energy and frequencies.

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
I am running this on the Perlmutter cluster at the National Energy Research Scientific Computing Center (NERSC)

@lastephey
Copy link

Thanks @danielpert. That new OFI error message is interesting. It looks like others have encountered it on Perlmutter/Crusher at OLCF. On Perlmutter they suggested two fixes:

export MPICH_COLL_SYNC=MPI_Bcast

or

export FI_CXI_DEFAULT_CQ_SIZE=71680
export FI_CXI_REQ_BUF_SIZE=12582912
export FI_UNIVERSE_SIZE=4096

Based on this comment it sounds like the first method was more reliable. Would you be willing to test?

Relevant issues:
Crusher
Perlmutter

@jeffhammond
Copy link
Collaborator

It would surprise me a lot if Bcast synchronization mattered to NWChem. It doesn't use it on the critical path anywhere I've read.

With MPI-PR, you'll want to look at settings that impact send-receive flow control.

@lastephey
Copy link

Thanks. I don't have enough knowledge to know if

  1. this kind of behavior suggests that we just need to find the correct setting (and if so, I'd appreciate any pointers) or
  2. this could reflect a problem with our network. If it's the latter, it would be helpful to know as soon as possible so we can engage with our vendor.

@danielpert
Copy link
Author

sorry for the delay, the job was waiting a long time on the queue and then perlmutter was also down for a bit. I tested with export MPICH_COLL_SYNC=MPI_Bcast and it failed with one of the same errors I saw before:

[63] header operation not recognized: -212919191
[63] ../../ga-5.8.1/comex/src-mpi-pr/comex.c:3277: _progress_server: Assertion `0' failed[63] Received an Error in Communication: (-1) comex_assert_fail
MPICH ERROR [Rank 63] [job id 8441431.0] [Fri May  5 09:40:55 2023] [nid005664] - Abort(-1) (rank 63 in comm 496): application called MPI_Abort(comm=0x84000001, -1) - process 63

I will also test with the other method

@edoapra
Copy link
Collaborator

edoapra commented May 8, 2023

@danielpert
Could you try the following Slurm script that uses a NWChem 7.2.0 Shifter image (nacl16_co.nw is the input file name in this example)?

#!/bin/bash
#SBATCH -C cpu
#SBATCH -t 0:29:00
#SBATCH -q debug
#SBATCH -N 8
#SBATCH -A XXXX
#SBATCH --cpus-per-task=2
#SBATCH --ntasks-per-node=64
#SBATCH -J nacl16_1co
#SBATCH -o nacl16_1co.%j.out
#SBATCH -e nacl16_1co.%j.out
#SBATCH --image=ghcr.io/nwchemgit/nwchem-dev.nersc.mpich4.mpi-pr:20230203_160345
echo image nwchemgit/nwchem-dev 20230203_160345
module purge
module load PrgEnv-gnu
module load cudatoolkit
module load cray-pmi
module list
export OMP_NUM_THREADS=1
export OMP_PROC_BIND=true
export COMEX_MAX_NB_OUTSTANDING=6
export FI_CXI_RX_MATCH_MODE=hybrid
export COMEX_EAGER_THRESHOLD=16384
export FI_CXI_RDZV_THRESHOLD=16384
export FI_CXI_OFLOW_BUF_COUNT=6
export MPICH_SMP_SINGLE_COPY_MODE=CMA
srun -N $SLURM_NNODES --cpu-bind=cores shifter --module=mpich nwchem nacl16_1co.nw

@danielpert
Copy link
Author

I tried that, the job did not fail but kind of just stopped and didn't do anything until it hit the wall time. I got this warning:

PE 191: MPICH WARNING: OFI is failing to make progress on posting a receive. MPICH suspects a hang due to completion queue exhaustion. Setting environment variable FI_CXI_DEFAULT_CQ_SIZE to a higher number might circumvent this scenario. OFI retry continuing...

I set FI_CXI_DEFAULT_CQ_SIZE=71680 but got the same issue.

@danielpert
Copy link
Author

I can try increasing it more to 143360?

@danielpert
Copy link
Author

danielpert commented May 9, 2023

I also got this message:

Unloading the cpe module is insufficient to restore the system defaults.
Please run 'source
/opt/cray/pe/cpe/23.03/restore_lmod_system_defaults.[csh|sh]'.

I can try adding source /opt/cray/pe/cpe/23.03/restore_lmod_system_defaults.sh to my script after module purge. This seems to work when I run in the terminal without any warnings. Not sure if this is the issue though

Update: I am still getting the same issue

@edoapra edoapra changed the title NWChem docker image fails with MPI errors NWChem Shifter image fails with MPI errors May 9, 2023
@edoapra
Copy link
Collaborator

edoapra commented May 10, 2023

@danielpert I have a fix for the poorly parallelized code that was causing the error posted in #775 (comment)

This fix is applied to the image ghcr.io/edoapra/nwchem-721.nersc.mpich4.mpi-pr:20230509_14411

Could you please try the same Slurm batch script I posted in #775 (comment) with the new shifter image?

#SBATCH --image=ghcr.io/edoapra/nwchem-721.nersc.mpich4.mpi-pr:20230509_14411
echo image ghcr.io/edoapra/nwchem-721.nersc.mpich4.mpi-pr:20230509_14411

edoapra added a commit to edoapra/nwchem that referenced this issue May 10, 2023
@danielpert
Copy link
Author

I cannot seem to use that image, when I submit the submission script I get this error
sbatch: error: Failed to lookup image. Aborting.

@edoapra
Copy link
Collaborator

edoapra commented May 10, 2023

Sorry about giving the wrong image name. I missed one last 1 character
Here are correct lines for the Slurm script

#SBATCH --image=ghcr.io/edoapra/nwchem-721.nersc.mpich4.mpi-pr:20230509_144111
echo image ghcr.io/edoapra/nwchem-721.nersc.mpich4.mpi-pr:20230509_144111

@danielpert
Copy link
Author

yes when I try that image my job runs successfully!

@edoapra
Copy link
Collaborator

edoapra commented May 10, 2023

yes when I try that image my job runs successfully!

Thank you very much for this feedback. Let me do more testing on this change just to be sure it does not break any other functionality.

@edoapra
Copy link
Collaborator

edoapra commented May 20, 2023

This fix is now present in the default NERSC Shifter images
ghcr.io/nwchemgit/nwchem-720.nersc.mpich4.mpi-pr:latest
ghcr.io/nwchemgit/nwchem-dev.nersc.mpich4.mpi-pr:latest

@edoapra
Copy link
Collaborator

edoapra commented May 23, 2023

The NERSC documentation for NWChem was updated with information about the current Shifter information for Perlmutter

https://docs.nersc.gov/applications/nwchem/#slurm-script-for-nwchem-shifter-image-on-perlmutter-cpus

@lastephey
Copy link

Thanks @edoapra!

Just a heads up that we're working on a new container runtime called podman-hpc: https://docs.nersc.gov/development/podman-hpc/overview/

It's still in an early phase with several known issues, but I wanted to put it on your radar since we may eventually retire Shifter in favor of podman-hpc (timeframe ~years, so no urgent action required).

@edoapra
Copy link
Collaborator

edoapra commented May 23, 2023

Thanks @edoapra!

Just a heads up that we're working on a new container runtime called podman-hpc: https://docs.nersc.gov/development/podman-hpc/overview/

It's still in an early phase with several known issues, but I wanted to put it on your radar since we may eventually retire Shifter in favor of podman-hpc (timeframe ~years, so no urgent action required).

Is podman available for any user on Perlmutter at this point in time?

@lastephey
Copy link

Yes, it's open to all users without any additional configuration required. Anyone can test today.

@edoapra edoapra closed this as completed Jun 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants