Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enforcing system abort when there is negative channel storage #6313

Merged

Conversation

liho745
Copy link
Contributor

@liho745 liho745 commented Mar 29, 2024

Enforces MOSART to stop when there is negative channel storage for any tracer. The e3sm_mosart_developer tests all passed at Compy BFB.

Fixes #6302
[BFB]

@tanzeli1982
Copy link
Contributor

The change looks okay for me. But I am not sure whether the e3sm_mosart_developer tests are sufficient. @bishtgautam Could you comment on it?

@bishtgautam
Copy link
Contributor

Sorry, I missed the discussion on this PR. IMO, the code changes here are good to go. Thanks.

@peterdschwartz
Copy link
Contributor

@tanzeli1982 @hydrotian still needs review

@peterdschwartz
Copy link
Contributor

@liho745 The ERS.f09_f09.IELM.pm-cpu_intel.elm-lnd_rof_2way fails due to negative channel storage

0:  ERROR: mosart: negative channel storage
  1:  Error: Negative channel storage found!   -261987.303320042
  1:  ERROR: mosart: negative channel storage
126:  Error: Negative channel storage found!   -121399.288355927
126:  ERROR: mosart: negative channel storage

@hydrotian
Copy link
Contributor

I believe @donghuix created this test. Maybe he can comment on that.

@donghuix
Copy link
Contributor

I believe @donghuix created this test. Maybe he can comment on that.

I am running ERS.f09_f09.IELM.pm-cpu_intel.elm-lnd_rof_2way to see why the negative storage occurs.

@donghuix
Copy link
Contributor

@hydrotian @peterdschwartz @liho745

  1. I reproduced the negative channel water storage warning with ERS.f09_f09.IELM.pm-cpu_intel.elm-lnd_rof_2way.
  2. I rerun this test by turning off lnd_rof_two_way to see if the two-way coupling causes the negative water storage. This simulation got negative channel water storage.

So, I suspect the negative channel water storage is due to the selected MOSART input file: /global/cfs/cdirs/e3sm/inputdata/rof/mosart/MOSART_global_half_20180721a.nc.

@hydrotian
Copy link
Contributor

@hydrotian @peterdschwartz @liho745

  1. I reproduced the negative channel water storage warning with ERS.f09_f09.IELM.pm-cpu_intel.elm-lnd_rof_2way.
  2. I rerun this test by turning off lnd_rof_two_way to see if the two-way coupling causes the negative water storage. This simulation got negative channel water storage.

So, I suspect the negative channel water storage is due to the selected MOSART input file: /global/cfs/cdirs/e3sm/inputdata/rof/mosart/MOSART_global_half_20180721a.nc.

@donghuix does this test use DW instead of KW?

@donghuix
Copy link
Contributor

@hydrotian @peterdschwartz @liho745

  1. I reproduced the negative channel water storage warning with ERS.f09_f09.IELM.pm-cpu_intel.elm-lnd_rof_2way.
  2. I rerun this test by turning off lnd_rof_two_way to see if the two-way coupling causes the negative water storage. This simulation got negative channel water storage.

So, I suspect the negative channel water storage is due to the selected MOSART input file: /global/cfs/cdirs/e3sm/inputdata/rof/mosart/MOSART_global_half_20180721a.nc.

@donghuix does this test use DW instead of KW?

This test uses KW, the default routing method.

@liho745
Copy link
Contributor Author

liho745 commented Apr 27, 2024

@donghuix Did you turn on the inundation flag in your tests? Any other MOSART flags?

@donghuix
Copy link
Contributor

@liho745 Yes, inundation was turned on and hypothetical elevation profile was used.

@liho745
Copy link
Contributor Author

liho745 commented Apr 27, 2024

@liho745 Yes, inundation was turned on and hypothetical elevation profile was used.

Could you turn off inundation and then try again? If the negative channel storage error disappears by doing so, we can narrow down the likely cause to the inundation module or its associated hypothetical elevation profile.

@liho745
Copy link
Contributor Author

liho745 commented Apr 27, 2024

Actually, @peterdschwartz Did the other ELM tests pass at your side? If so, then we can be more certain that the error comes from the inundation module or elevation profile, since in most ELM tests MOSART is included with the KW option but any other flags off.

@peterdschwartz
Copy link
Contributor

@liho745 Yes, that test was the only one that failed

@donghuix
Copy link
Contributor

donghuix commented May 1, 2024

@liho745 @hydrotian @peterdschwartz

I made a mistake in previous tests that inundation scheme causes negative channel storage. Now, I confirm the negative channel storage is caused by two-way coupling. If two-way coupling is turned off, inundation scheme does not cause negative storage.

The reason for the two-way coupling is because the one time step shift when ELM and MOSART are coupled. For example, the floodplain infiltration should be constrained by the floodplain inundation volume. But when the inundation volume is send back from ELM to MOSART, the infiltration on the floodplain can be larger than the floodplain inundation volume because they are not from the same period.

To force mass balance, if the floodplain infiltration is larger than the floodplain inundation volume, I removed the additional infiltration from the main channel. Since there is no elevation profile in the selected MOSART input file, I used the hypothetical elevation profile option. This hypothetical elevation profile is relative mild, which can result in unrealistic larger floodplain inundation area than the main channel area.

To fix this issue, I propose to change the MOSART input file in the land river two-way test. Hopefully, the floodplain inundation will be well constrained with a realistic elevation profile, which will not cause unrealistic floodplain infiltration.

Please let me know if this makes sense. If you agree with the plan, I can go ahead to test with another MOSART input file with realistic elevation profile.

@liho745 liho745 closed this May 1, 2024
@liho745
Copy link
Contributor Author

liho745 commented May 1, 2024

@donghuix Your plan sounds good to me. If you want to take this opportunity and make the two-way coupling code more robust (i.e., dealing with some extreme situation like this hypothetical elevation profile), that'd be even better.

@E3SM-Project E3SM-Project deleted a comment from liho745 May 1, 2024
@rljacob
Copy link
Member

rljacob commented May 1, 2024

@liho745 did you mean to close this PR ?

@liho745
Copy link
Contributor Author

liho745 commented May 1, 2024

@liho745 did you mean to close this PR ?

This PR is still useful, as it can help prevent future negative storage issues when adding or modifying MOSART-related features. My suggestion is 1) hold this PR; 2) @donghuix fixes the issue in the two-way coupling module and issue his PR on that; 3) the current PR can then be merged into the master branch after Donghui's PR.

@donghuix
Copy link
Contributor

donghuix commented May 2, 2024

@rljacob @liho745 @hydrotian @tanzeli1982 @peterdschwartz
With more tests, I found the negative storage is caused by the mismatch of resolution between ELM and MOSART domain. For example, ERS.f09_f09.IELM.pm-cpu_intel.elm-lnd_rof_2way uses f09_f09 grid that ELM is 0.9x1.25, but MOSART is r05.

I propose to change ERS.f09_f09.IELM.pm-cpu_intel.elm-lnd_rof_2way to ERS.r05_r05.IELM.pm-cpu_intel.elm-lnd_rof_2way in the test suite. And I ran the test with r05_r05 grid and there is no negative storage in MOSART.

If this works with everyone, I can issue a PR to change the lnd_rof_2way test.

@liho745
Copy link
Contributor Author

liho745 commented May 2, 2024

Sounds good to me. I am traveling in May 2-22. If there is anything for me to work on, it'd be after May 22.

@peterdschwartz
Copy link
Contributor

I am reopening this PR and I will aim to merge it on the same day as PR #6388 or the day after.

@peterdschwartz peterdschwartz reopened this May 2, 2024
peterdschwartz added a commit that referenced this pull request May 9, 2024
)

ERS.f09_f09.IELM.elm-lnd_rof_2way in the test suits results in negative channel water storage in MOSART.
This was raised in #6313.
The reason is that f09_f09 uses different spatial resolutions in ELM and MOSART.
This is problematic for land river two-way coupling, which requires the same grid in ELM and MOSART.

This PR changed f09_f09 to r05_r05 for the land river two-way test, and there is no negative main channel storage.

[BFB]
peterdschwartz added a commit that referenced this pull request May 9, 2024
…o next(PR #6313)

Enforces MOSART to stop when there is negative channel storage for any tracer. The e3sm_mosart_developer tests all passed at Compy BFB.

Fixes #6302
[BFB]
@peterdschwartz
Copy link
Contributor

merged to next

peterdschwartz added a commit that referenced this pull request May 10, 2024
ERS.f09_f09.IELM.elm-lnd_rof_2way in the test suits results in negative channel water storage in MOSART.
This was raised in PR #6313.
The reason is that f09_f09 uses different spatial resolutions in ELM and MOSART.
This is problematic for land river two-way coupling, which requires the same grid in ELM and MOSART.

This PR changed f09_f09 to r05_r05 for the land river two-way test, and there is no negative main channel storage.

[BFB]
@peterdschwartz peterdschwartz merged commit 427b860 into E3SM-Project:master May 10, 2024
21 checks passed
@jonbob
Copy link
Contributor

jonbob commented Jun 11, 2024

@peterdschwartz -- as part of the coupled model team, we've hit this abort in at least two recent runs, one LR and one HR. Both times the negative value was extremely small, on the order of -1e-09. In order to get those runs to progress, we've played with reverting this PR in testing codebases. Do you think it might be reasonable to have a different abort threshold, just in cases there are negative values that might be more roundoff-ish like these? Or maybe a warning between 0 and some small negative number and then abort if the negative number is larger?

@bishtgautam
Copy link
Contributor

@liho745, please see @jonbob's comment/question above.

@liho745
Copy link
Contributor Author

liho745 commented Jun 11, 2024

@peterdschwartz -- as part of the coupled model team, we've hit this abort in at least two recent runs, one LR and one HR. Both times the negative value was extremely small, on the order of -1e-09. In order to get those runs to progress, we've played with reverting this PR in testing codebases. Do you think it might be reasonable to have a different abort threshold, just in cases there are negative values that might be more roundoff-ish like these? Or maybe a warning between 0 and some small negative number and then abort if the negative number is larger?

@jonbob What MOSART modules are turned on in these two recent runs? ELM-MOSART two-way coupling? Inundation? Or just default options? The last suggestion will work as a temporary solution to get things going. Perhaps we can make this negative number as -10-8 or something like that? In the near future, a more thorough investigation into the code may be needed.

@hydrotian
Copy link
Contributor

@liho745 As far as I understand, there's no additional feature turned on in these runs and they were using KW.

@liho745
Copy link
Contributor Author

liho745 commented Jun 12, 2024

Thanks, @hydrotian. I will see whether I can either find a reasonable but large enough threshold for all cases, or make further changes in the code to avoid negative channel storage completely.

@jonbob
Copy link
Contributor

jonbob commented Jun 12, 2024

Thanks @liho745, @hydrotian, @bishtgautam

@stephenprice
Copy link
Contributor

Just an FYI that I'm also seeing an IG simulation (I case w/ active Greenland component) killed on PM-cpu as a result of this error:

Error: Negative channel storage found!  -1.4901161193847656E-008
ERROR: mosart: negative channel storage

@golaz
Copy link
Contributor

golaz commented Sep 16, 2024

I was trying to rerun part of v3.LR.piControl with additional output (simulation part of the official v3.LR simulation campaign). But the rerun fails because of this PR. I verified that the simulation if BFB with the original one until the point when it stops:

956:  Error: Negative channel storage found!  -3.725290298461914E-009
956:  ERROR: mosart: negative channel storage
956: Image              PC                Routine            Line        Source             
956: libpnetcdf.so.3.0  000015554B93A0CA  tracebackqq_          Unknown  Unknown
956: e3sm.exe           0000000005EDF3D0  shr_abort_mod_mp_         114  shr_abort_mod.F90
956: e3sm.exe           0000000005B00A82  mosart_physics_mo         686  MOSART_physics_mod.F90
956: e3sm.exe           00000000059DA141  rtmmod_mp_rtmrun_        2603  RtmMod.F90
956: e3sm.exe           000000000594A27C  rof_comp_mct_mp_r         472  rof_comp_mct.F90
956: e3sm.exe           00000000004685DB  component_mod_mp_         757  component_mod.F90
956: e3sm.exe           0000000000431E9E  cime_comp_mod_mp_        2975  cime_comp_mod.F90
956: e3sm.exe           0000000000468210  MAIN__                    153  cime_driver.F90
956: e3sm.exe           0000000000426222  Unknown               Unknown  Unknown
956: libc-2.28.so       0000155545051D85  __libc_start_main     Unknown  Unknown
956: e3sm.exe           000000000042612E  Unknown               Unknown  Unknown
956: --------------------------------------------------------------------------
956: MPI_ABORT was invoked on rank 956 in communicator MPI_COMM_WORLD
956: with errorcode 1001.

For details, see
/lcrc/group/e3sm/ac.golaz/E3SMv3/v3.LR.piControl_bonus/run/rof.log.584407.240915-100856

I think we need to revert this PR since it is obviously not BFB as it prevents us from rerunning existing v3.LR simulations.

@liho745
Copy link
Contributor Author

liho745 commented Sep 17, 2024 via email

wlin7 added a commit that referenced this pull request Sep 17, 2024
…c63' (PR #6313)"

This reverts commit 427b860, reversing
changes made to cf32a25.
wlin7 added a commit that referenced this pull request Sep 17, 2024
…#6623)

Revert PR #6313 aborting run on negative channel storage

The enforced abort whenever MOSART has negative channel storage
may prevent v3.LR simulations from finishing, if the code base that includes
to ensure new v3.LR simulations that need to use the latest commits to master
are free from this issue.

Fixes #6622.

[BFB]
wlin7 added a commit that referenced this pull request Sep 18, 2024
Revert PR #6313 aborting run on negative channel storage

The enforced abort whenever MOSART has negative channel storage
may prevent v3.LR simulations from finishing, if the code base that includes
to ensure new v3.LR simulations that need to use the latest commits to master
are free from this issue.

Fixes #6622.

[BFB]
jgfouca added a commit that referenced this pull request Oct 9, 2024
…uild

* upstream/master: (194 commits)
  Everything working now
  Fixes for mpi-serial on mappy
  Revert "Merge remote-tracking branch 'liho/liho745/river/bug-fix-7792c63' (PR #6313)"
  Get mappy working again with RHEL9
  Bump DavidAnson/markdownlint-cli2-action from 16 to 17
  Log PIO buffer size limit for default case
  Update sim_year_range in 20thC_transient.xml
  Updates cprnc version to match the new compiler version
  Add LND_FRC_DUST_MBL to elm.h0 for wcprod mods
  Reset sim_year_range to 1850-2015 in namelist_defaults for flanduse_timeseries
  Updates the long_name
  adjust artifacts to reflect recent edits
  update ghci container to restore gh/ci
  Enables output of land fraction used in dust mobilization
  update compy intel compiler
  Adjust PEs for MPAS dev-tests on Anvil
  Reduce test lengths
  Update CIME submodule with a new fix commit
  bug fix for N balance error due to spval value for land use simulation
  Explicitly set dust_emis_scheme=2 for MMF
  ...
@rljacob
Copy link
Member

rljacob commented Nov 12, 2024

To summarize, this PR was reverted so that v3.0 cases would not change behavior. The fixes in this PR were later put back on master (after the maint-3.0 branch was created) in #6568.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BFB PR leaves answers BFB bug fix PR River
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MOSART is not stopping on "Negative channel storage found! "
10 participants