-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NorESM2.5 crash seemingly related to PIO_STRIDE setting #593
Comments
@matsbn, do you have a |
The env_mach_pes.xml is pasted in below (could not attach xml-files) or did you actually mean the file pelayout in the case folder?
|
Sorry, I meant the output from the |
@jedwards4b - do you have any input about using a PIO_STRIDE that is less than the number of tasks for a given component? This applies to the statement from @matsbn - |
There should be code that prevents this situation from occurring. ICE_PIO_STRIDE needs to be <= ICE_NTASKS |
@jedwards4b - thanks for confirming this. I don't think there is code in cime to prevent this. |
Describe the bug
Please provide a clear and concise description of what the bug is.
123: [b3165:425392] *** An error occurred in MPI_Gather
123: [b3165:425392] *** reported by process [23226195771392,123]
123: [b3165:425392] *** on communicator MPI COMMUNICATOR 49 SPLIT FROM 44
123: [b3165:425392] *** MPI_ERR_TRUNCATE: message truncated
123: [b3165:425392] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
123: [b3165:425392] *** and potentially your MPI job)
0: slurmstepd: error: *** STEP 1033510.0 ON b3164 CANCELLED AT 2024-11-19T15:13:33 ***
To Reproduce
Steps to reproduce the behavior:
Expected behavior
The model stops unexpectedly during what I believe is an I/O operation. I have found that reducing PIO_STRIDE in env_run.xml from $MAX_MPITASKS_PER_NODE (128 on Betzy) to 8 solves the problem. A hypothesis is that $MAX_MPITASKS_PER_NODE is larger that the CICE processor count in this configuration, leading to the stop during I/O. The setting of PIO_STRIDE to 8 is rather arbitrary and the functioning and performance of other PIO_STRIDE settings have not been explored.
The text was updated successfully, but these errors were encountered: