-
Notifications
You must be signed in to change notification settings - Fork 368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enforcing system abort when there is negative channel storage #6313
Enforcing system abort when there is negative channel storage #6313
Conversation
The change looks okay for me. But I am not sure whether the e3sm_mosart_developer tests are sufficient. @bishtgautam Could you comment on it? |
Sorry, I missed the discussion on this PR. IMO, the code changes here are good to go. Thanks. |
@tanzeli1982 @hydrotian still needs review |
@liho745 The
|
I believe @donghuix created this test. Maybe he can comment on that. |
I am running |
@hydrotian @peterdschwartz @liho745
So, I suspect the negative channel water storage is due to the selected MOSART input file: |
@donghuix does this test use DW instead of KW? |
This test uses KW, the default routing method. |
@donghuix Did you turn on the inundation flag in your tests? Any other MOSART flags? |
@liho745 Yes, inundation was turned on and hypothetical elevation profile was used. |
Could you turn off inundation and then try again? If the negative channel storage error disappears by doing so, we can narrow down the likely cause to the inundation module or its associated hypothetical elevation profile. |
Actually, @peterdschwartz Did the other ELM tests pass at your side? If so, then we can be more certain that the error comes from the inundation module or elevation profile, since in most ELM tests MOSART is included with the KW option but any other flags off. |
@liho745 Yes, that test was the only one that failed |
@liho745 @hydrotian @peterdschwartz I made a mistake in previous tests that inundation scheme causes negative channel storage. Now, I confirm the negative channel storage is caused by two-way coupling. If two-way coupling is turned off, inundation scheme does not cause negative storage. The reason for the two-way coupling is because the one time step shift when ELM and MOSART are coupled. For example, the floodplain infiltration should be constrained by the floodplain inundation volume. But when the inundation volume is send back from ELM to MOSART, the infiltration on the floodplain can be larger than the floodplain inundation volume because they are not from the same period. To force mass balance, if the floodplain infiltration is larger than the floodplain inundation volume, I removed the additional infiltration from the main channel. Since there is no elevation profile in the selected MOSART input file, I used the hypothetical elevation profile option. This hypothetical elevation profile is relative mild, which can result in unrealistic larger floodplain inundation area than the main channel area. To fix this issue, I propose to change the MOSART input file in the land river two-way test. Hopefully, the floodplain inundation will be well constrained with a realistic elevation profile, which will not cause unrealistic floodplain infiltration. Please let me know if this makes sense. If you agree with the plan, I can go ahead to test with another MOSART input file with realistic elevation profile. |
@donghuix Your plan sounds good to me. If you want to take this opportunity and make the two-way coupling code more robust (i.e., dealing with some extreme situation like this hypothetical elevation profile), that'd be even better. |
@liho745 did you mean to close this PR ? |
This PR is still useful, as it can help prevent future negative storage issues when adding or modifying MOSART-related features. My suggestion is 1) hold this PR; 2) @donghuix fixes the issue in the two-way coupling module and issue his PR on that; 3) the current PR can then be merged into the master branch after Donghui's PR. |
@rljacob @liho745 @hydrotian @tanzeli1982 @peterdschwartz I propose to change If this works with everyone, I can issue a PR to change the |
Sounds good to me. I am traveling in May 2-22. If there is anything for me to work on, it'd be after May 22. |
I am reopening this PR and I will aim to merge it on the same day as PR #6388 or the day after. |
) ERS.f09_f09.IELM.elm-lnd_rof_2way in the test suits results in negative channel water storage in MOSART. This was raised in #6313. The reason is that f09_f09 uses different spatial resolutions in ELM and MOSART. This is problematic for land river two-way coupling, which requires the same grid in ELM and MOSART. This PR changed f09_f09 to r05_r05 for the land river two-way test, and there is no negative main channel storage. [BFB]
merged to next |
ERS.f09_f09.IELM.elm-lnd_rof_2way in the test suits results in negative channel water storage in MOSART. This was raised in PR #6313. The reason is that f09_f09 uses different spatial resolutions in ELM and MOSART. This is problematic for land river two-way coupling, which requires the same grid in ELM and MOSART. This PR changed f09_f09 to r05_r05 for the land river two-way test, and there is no negative main channel storage. [BFB]
@peterdschwartz -- as part of the coupled model team, we've hit this abort in at least two recent runs, one LR and one HR. Both times the negative value was extremely small, on the order of -1e-09. In order to get those runs to progress, we've played with reverting this PR in testing codebases. Do you think it might be reasonable to have a different abort threshold, just in cases there are negative values that might be more roundoff-ish like these? Or maybe a warning between 0 and some small negative number and then abort if the negative number is larger? |
@jonbob What MOSART modules are turned on in these two recent runs? ELM-MOSART two-way coupling? Inundation? Or just default options? The last suggestion will work as a temporary solution to get things going. Perhaps we can make this negative number as -10-8 or something like that? In the near future, a more thorough investigation into the code may be needed. |
@liho745 As far as I understand, there's no additional feature turned on in these runs and they were using KW. |
Thanks, @hydrotian. I will see whether I can either find a reasonable but large enough threshold for all cases, or make further changes in the code to avoid negative channel storage completely. |
Thanks @liho745, @hydrotian, @bishtgautam |
Just an FYI that I'm also seeing an IG simulation (I case w/ active Greenland component) killed on PM-cpu as a result of this error:
|
I was trying to rerun part of v3.LR.piControl with additional output (simulation part of the official v3.LR simulation campaign). But the rerun fails because of this PR. I verified that the simulation if BFB with the original one until the point when it stops:
For details, see I think we need to revert this PR since it is obviously not BFB as it prevents us from rerunning existing v3.LR simulations. |
With Jon's help, I have submitted a new PR, #6568, to fix this issue.
…On Mon, Sep 16, 2024 at 6:43 PM golaz ***@***.***> wrote:
I was trying to rerun part of v3.LR.piControl with additional output
(simulation part of the official v3.LR simulation campaign). But the rerun
fails because of this PR. I verified that the simulation if BFB with the
original one until the point when it stops:
956: Error: Negative channel storage found! -3.725290298461914E-009
956: ERROR: mosart: negative channel storage
956: Image PC Routine Line Source
956: libpnetcdf.so.3.0 000015554B93A0CA tracebackqq_ Unknown Unknown
956: e3sm.exe 0000000005EDF3D0 shr_abort_mod_mp_ 114 shr_abort_mod.F90
956: e3sm.exe 0000000005B00A82 mosart_physics_mo 686 MOSART_physics_mod.F90
956: e3sm.exe 00000000059DA141 rtmmod_mp_rtmrun_ 2603 RtmMod.F90
956: e3sm.exe 000000000594A27C rof_comp_mct_mp_r 472 rof_comp_mct.F90
956: e3sm.exe 00000000004685DB component_mod_mp_ 757 component_mod.F90
956: e3sm.exe 0000000000431E9E cime_comp_mod_mp_ 2975 cime_comp_mod.F90
956: e3sm.exe 0000000000468210 MAIN__ 153 cime_driver.F90
956: e3sm.exe 0000000000426222 Unknown Unknown Unknown
956: libc-2.28.so 0000155545051D85 __libc_start_main Unknown Unknown
956: e3sm.exe 000000000042612E Unknown Unknown Unknown
956: --------------------------------------------------------------------------
956: MPI_ABORT was invoked on rank 956 in communicator MPI_COMM_WORLD
956: with errorcode 1001.
For details, see
/lcrc/group/e3sm/ac.golaz/E3SMv3/v3.LR.piControl_bonus/run/rof.log.584407.240915-100856
I think we need to revert this PR since it is obviously not BFB as it
prevents us from rerunning existing v3.LR simulations.
—
Reply to this email directly, view it on GitHub
<#6313 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AC5QWBVGV6OF4AZOYVCANJDZW5UJJAVCNFSM6AAAAABFN336T6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNJUGIYTSNRWG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
…#6623) Revert PR #6313 aborting run on negative channel storage The enforced abort whenever MOSART has negative channel storage may prevent v3.LR simulations from finishing, if the code base that includes to ensure new v3.LR simulations that need to use the latest commits to master are free from this issue. Fixes #6622. [BFB]
Revert PR #6313 aborting run on negative channel storage The enforced abort whenever MOSART has negative channel storage may prevent v3.LR simulations from finishing, if the code base that includes to ensure new v3.LR simulations that need to use the latest commits to master are free from this issue. Fixes #6622. [BFB]
…uild * upstream/master: (194 commits) Everything working now Fixes for mpi-serial on mappy Revert "Merge remote-tracking branch 'liho/liho745/river/bug-fix-7792c63' (PR #6313)" Get mappy working again with RHEL9 Bump DavidAnson/markdownlint-cli2-action from 16 to 17 Log PIO buffer size limit for default case Update sim_year_range in 20thC_transient.xml Updates cprnc version to match the new compiler version Add LND_FRC_DUST_MBL to elm.h0 for wcprod mods Reset sim_year_range to 1850-2015 in namelist_defaults for flanduse_timeseries Updates the long_name adjust artifacts to reflect recent edits update ghci container to restore gh/ci Enables output of land fraction used in dust mobilization update compy intel compiler Adjust PEs for MPAS dev-tests on Anvil Reduce test lengths Update CIME submodule with a new fix commit bug fix for N balance error due to spval value for land use simulation Explicitly set dust_emis_scheme=2 for MMF ...
To summarize, this PR was reverted so that v3.0 cases would not change behavior. The fixes in this PR were later put back on master (after the maint-3.0 branch was created) in #6568. |
Enforces MOSART to stop when there is negative channel storage for any tracer. The e3sm_mosart_developer tests all passed at Compy BFB.
Fixes #6302
[BFB]