-
Notifications
You must be signed in to change notification settings - Fork 162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NWChem Shifter image fails with MPI errors #775
Comments
Thanks @danielpert. That new OFI error message is interesting. It looks like others have encountered it on Perlmutter/Crusher at OLCF. On Perlmutter they suggested two fixes:
or
Based on this comment it sounds like the first method was more reliable. Would you be willing to test? Relevant issues: |
It would surprise me a lot if Bcast synchronization mattered to NWChem. It doesn't use it on the critical path anywhere I've read. With MPI-PR, you'll want to look at settings that impact send-receive flow control. |
Thanks. I don't have enough knowledge to know if
|
sorry for the delay, the job was waiting a long time on the queue and then perlmutter was also down for a bit. I tested with
I will also test with the other method |
@danielpert
|
I tried that, the job did not fail but kind of just stopped and didn't do anything until it hit the wall time. I got this warning:
I set FI_CXI_DEFAULT_CQ_SIZE=71680 but got the same issue. |
I can try increasing it more to 143360? |
I also got this message:
I can try adding Update: I am still getting the same issue |
@danielpert I have a fix for the poorly parallelized code that was causing the error posted in #775 (comment) This fix is applied to the image Could you please try the same Slurm batch script I posted in #775 (comment) with the new shifter image?
|
I cannot seem to use that image, when I submit the submission script I get this error |
Sorry about giving the wrong image name. I missed one last
|
yes when I try that image my job runs successfully! |
Thank you very much for this feedback. Let me do more testing on this change just to be sure it does not break any other functionality. |
This fix is now present in the default NERSC Shifter images |
The NERSC documentation for NWChem was updated with information about the current Shifter information for Perlmutter https://docs.nersc.gov/applications/nwchem/#slurm-script-for-nwchem-shifter-image-on-perlmutter-cpus |
Thanks @edoapra! Just a heads up that we're working on a new container runtime called It's still in an early phase with several known issues, but I wanted to put it on your radar since we may eventually retire Shifter in favor of podman-hpc (timeframe ~years, so no urgent action required). |
Is podman available for any user on Perlmutter at this point in time? |
Yes, it's open to all users without any additional configuration required. Anyone can test today. |
Describe the bug
When I try to run a geometry optimization followed by DFT frequency calculation, the program fails after the geometry optimization. The last thing in the .out file is "Multipole analysis of the density". The error message I am getting is:
Describe settings used
I am using these environment variables:
export OMP_NUM_THREADS=2
export OMP_PROC_BIND=spread
export MPICH_GNI_MAX_EAGER_MSG_SIZE=131026
export MPICH_GNI_NUM_BUFS=80
export MPICH_GNI_NDREG_MAXSIZE=16777216
export MPICH_GNI_MBOX_PLACEMENT=nic
export MPICH_GNI_RDMA_THRESHOLD=65536
export COMEX_MAX_NB_OUTSTANDING=6
At first I got this error after 3 minutes:
I added these environmental variables with help from @lastephey which allowed the geometry optimization to run but then I got the error described above when it tried to start calculating the vibrational frequencies:
export CXI_FORK_SAFE=1
export CXI_FORK_SAFE_HP=1
export FI_CXI_RX_MATCH_MODE=hybrid
export FI_CXI_DEFAULT_CQ_SIZE=128000
Report what operating system and distribution you are using.
SUSE Linux Enterprise Server 15 SP4
Attach log files
files.zip contains my submission script, nwchem input, starting geometry, and stdout/stderr
To Reproduce
Run NWchem using the attached input and environment variables with the docker image
Expected behavior
I expected the program to complete and calculate the energy and frequencies.
Screenshots
If applicable, add screenshots to help explain your problem.
Additional context
I am running this on the Perlmutter cluster at the National Energy Research Scientific Computing Center (NERSC)
The text was updated successfully, but these errors were encountered: