You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue appears when running WRF in dm+sm mode. It was reported on aarch4 (Graviton3: neoverse-v1). The symptom is that WRF calls MPI_Abort, but doesn't print any message. However re-running the same input often succeeds, and failures only happen occasionally (typically on the first timestep).
With the print enabled, we found a few grid points would occasionally lose water in the order of >.1 but <1 kg/m^2/dt. Investigation into the error cause showed that the scalar terms contributing to the water balance were identical between failing and successful runs. The primary difference was in the soil moisture. Diffing the output dataset showed no corrupt-looking data, only small differences induced by the stochastic energy flux methods.
Eventually I discovered what I believe to be the root cause: calculate_soil is being assigned twice within noahmplsm. First it is set to .false. then if a modulo is 0, then it is set to .true.. However the variable is scoped to the whole module, so all threads share the storage of calculate_soil. This leaves the potential for thread B to have passed this initialization block, and try to use the value while thread A is between the .false. and .true. assignments, resulting in an inconsistent value of calculate_soil to be observed by thread B during the subroutine execution.
The text was updated successfully, but these errors were encountered:
lrbison
added a commit
to lrbison/noahmp
that referenced
this issue
Jul 17, 2024
This issue appears when running WRF in dm+sm mode. It was reported on aarch4 (Graviton3: neoverse-v1). The symptom is that WRF calls MPI_Abort, but doesn't print any message. However re-running the same input often succeeds, and failures only happen occasionally (typically on the first timestep).
Upon further investigation, it seems that the non-master thread is calling wrf_error_fatal from here: https://github.com/NCAR/noahmp/blob/release-v4.5-WRF/src/module_sf_noahmplsm.F#L1727 however none of the messages are printed, because in wrf_message, all output is guarded by an
!$OMP MASTER
block, and it seems the error is being triggered from non-master threads.With the print enabled, we found a few grid points would occasionally lose water in the order of >.1 but <1 kg/m^2/dt. Investigation into the error cause showed that the scalar terms contributing to the water balance were identical between failing and successful runs. The primary difference was in the soil moisture. Diffing the output dataset showed no corrupt-looking data, only small differences induced by the stochastic energy flux methods.
Eventually I discovered what I believe to be the root cause:
calculate_soil
is being assigned twice withinnoahmplsm
. First it is set to.false.
then if a modulo is 0, then it is set to.true.
. However the variable is scoped to the whole module, so all threads share the storage ofcalculate_soil
. This leaves the potential for thread B to have passed this initialization block, and try to use the value while thread A is between the .false. and .true. assignments, resulting in an inconsistent value ofcalculate_soil
to be observed by thread B during the subroutine execution.The text was updated successfully, but these errors were encountered: