Error in Water Balance: "The model is losing water (ERRWAT is negative)" #135

lrbison · 2024-07-17T15:39:28Z

This issue appears when running WRF in dm+sm mode. It was reported on aarch4 (Graviton3: neoverse-v1). The symptom is that WRF calls MPI_Abort, but doesn't print any message. However re-running the same input often succeeds, and failures only happen occasionally (typically on the first timestep).

Upon further investigation, it seems that the non-master thread is calling wrf_error_fatal from here: https://github.com/NCAR/noahmp/blob/release-v4.5-WRF/src/module_sf_noahmplsm.F#L1727 however none of the messages are printed, because in wrf_message, all output is guarded by an !$OMP MASTER block, and it seems the error is being triggered from non-master threads.

With the print enabled, we found a few grid points would occasionally lose water in the order of >.1 but <1 kg/m^2/dt. Investigation into the error cause showed that the scalar terms contributing to the water balance were identical between failing and successful runs. The primary difference was in the soil moisture. Diffing the output dataset showed no corrupt-looking data, only small differences induced by the stochastic energy flux methods.

Eventually I discovered what I believe to be the root cause: calculate_soil is being assigned twice within noahmplsm. First it is set to .false. then if a modulo is 0, then it is set to .true.. However the variable is scoped to the whole module, so all threads share the storage of calculate_soil. This leaves the potential for thread B to have passed this initialization block, and try to use the value while thread A is between the .false. and .true. assignments, resulting in an inconsistent value of calculate_soil to be observed by thread B during the subroutine execution.

The text was updated successfully, but these errors were encountered:

Addresses issue NCAR#135.

lrbison · 2024-07-22T20:58:55Z

~~PR is merged. Thank you!~~ oops, misread that

cenlinhe · 2024-07-22T21:01:20Z

we will merge the PR very soon after some internal testing and we will close this issue once it is merged. thank you!

lrbison added a commit to lrbison/noahmp that referenced this issue Jul 17, 2024

Prevent stale values of calculate_soil from leaking across threads

c814175

Addresses issue NCAR#135.

lrbison linked a pull request Jul 17, 2024 that will close this issue

Prevent stale values of calculate_soil from leaking across threads #136

Open

lrbison closed this as completed Jul 22, 2024

lrbison reopened this Jul 22, 2024

tslin2 linked a pull request Oct 9, 2024 that will close this issue

Prevent stale values of calculate_soil from leaking across threads #136

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in Water Balance: "The model is losing water (ERRWAT is negative)" #135

Error in Water Balance: "The model is losing water (ERRWAT is negative)" #135

lrbison commented Jul 17, 2024

lrbison commented Jul 22, 2024 •

edited

Loading

cenlinhe commented Jul 22, 2024

Error in Water Balance: "The model is losing water (ERRWAT is negative)" #135

Error in Water Balance: "The model is losing water (ERRWAT is negative)" #135

Comments

lrbison commented Jul 17, 2024

lrbison commented Jul 22, 2024 • edited Loading

cenlinhe commented Jul 22, 2024

lrbison commented Jul 22, 2024 •

edited

Loading