Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clean up logging #792

Merged
merged 14 commits into from
Apr 26, 2022
Merged

clean up logging #792

merged 14 commits into from
Apr 26, 2022

Conversation

rgknox
Copy link
Contributor

@rgknox rgknox commented Oct 6, 2021

Description:

This set of changes seeks to ensure that all write statements to logs, that are intended to "warn", are either:

  1. encapsulated in some type of debug clause, where the user has to set the debug level to verbose, or
  2. is followed by an endrun
  3. uses the FatesWarn() wrapper

FatesWarn() has a feature where each warning message is associated with an index. Once the specific warning has been written 100 times (or other desired value), it will stop reporting the warning. It will continue to count how many times the warning trips. It maintains these counts at the node level.

In some situations, such as the interface, we had also been requiring verbose output prior to an endrun. If the model hits an endrun prematurely, it is a failure, and thus all logging information is welcomed. So the verbose requirement was removed.

In a few situations we had been reporting errors that were fairly catastrophic without failing (such as encountering negative patch areas). So, endrun() was added to a few locations.

Fixes: #772
Fixes: #236
Fixes: #191
Fixes: #793

Collaborators:

@jkshuman

Expectation of Answer Changes:

This should not change answers

Checklist:

  • My change requires a change to the documentation.
  • I have updated the in-code documentation .AND. (the technical note .OR. the wiki) accordingly.
  • I have read the CONTRIBUTING document.
  • FATES PASS/FAIL regression tests were run
  • If answers were expected to change, evaluation was performed and provided

Test Results:

CTSM (or) E3SM (specify which) test hash-tag:

CTSM (or) E3SM (specify which) baseline hash-tag:

FATES baseline hash-tag:

Test Output:

… clause, or ensuring there is an endrun following.
@rgknox
Copy link
Contributor Author

rgknox commented Oct 6, 2021

Updated the metadata of the decompmicc parameter in the parameter file. See: ef9126a

Copy link
Contributor

@glemieux glemieux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good amount of cleanup; really nice work.

I couldn't tag this directly to the line in github, but was the following overlooked or intentionally left without a debug flag check since it's only writing to the lead node?

if(logging_time .and. (is_master.eq.itrue) ) then
write(fates_log(),fmt) 'Logging Event Enacted on date: ', &
hlm_current_month,'-',hlm_current_day,'-',hlm_current_year
end if

biogeochem/EDCohortDynamicsMod.F90 Show resolved Hide resolved
@rgknox
Copy link
Contributor Author

rgknox commented Oct 7, 2021

I did intentionally leave the logging event print-outs. They should be fairly sparse, and I figure it is useful for people using the logging module to have that verification of their events. Again, don't feel strongly here, I can also ecapsulate this in a debug.

@jkshuman
Copy link
Contributor

jkshuman commented Oct 13, 2021

@rgknox this looks good and will certainly help reduce the logging event print-outs. After I went through this I looked at my logs to confirm a reduction, and this will capture some things. But there is one that is not addressed. In the CESM.log and LND.log I get a consistent "BalanceCheck, solar radiation balance error" , but from what I can tell this is outside of FATES here:
src/biogeophys/BalanceCheckMod.F90: write(iulog,*)'WARNING:: BalanceCheck, solar radiation balance error (W/m2)'

So, I guess this BalanceCheck solar radiation error warning is a separate issue then, and I can approve this PR? Should I open a separate issue for that, or would this be on the HLM side?

Example from recent LND.log:
clm: completed timestep 4397444
(shr_dmodel_readstrm) file ub: /glade/p/cesmdata/cseg/inputdata/atm/datm7/NASA_LIS/clmforc.Li_2016_climo1995-2013.360x720.lnfm_Total_c160825.nc 4363
clm: completed timestep 4397445
WARNING:: BalanceCheck, solar radiation balance error (W/m2)
nstep = 4397446
errsol = -1.125070525631600E-007
clm: completed timestep 4397446
WARNING:: BalanceCheck, solar radiation balance error (W/m2)
nstep = 4397447
errsol = -1.602288648427930E-007
clm: completed timestep 4397447
WARNING:: BalanceCheck, solar radiation balance error (W/m2)
nstep = 4397448
errsol = -1.805403826438123E-007
clm: completed timestep 4397448
(shr_dmodel_readstrm) file ub: /glade/p/cesmdata/cseg/inputdata/atm/datm7/NASA_LIS/clmforc.Li_2016_climo1995-2013.360x720.lnfm_Total_c160825.nc 4364
WARNING:: BalanceCheck, solar radiation balance error (W/m2)
nstep = 4397449
errsol = -1.813866674638120E-007
clm: completed timestep 4397449
WARNING:: BalanceCheck, solar radiation balance error (W/m2)
nstep = 4397450
errsol = -2.600071979941276E-007
clm: completed timestep 4397450

Example of some of that reporting from a recent CESM.log:
107: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
107: nstep = 4397517
107: errsol = -2.104200120811583E-007
265: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
265: nstep = 4397517
265: errsol = -1.050174205374788E-007
14: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
14: nstep = 4397517
14: errsol = -1.005038257062552E-007
119: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
119: nstep = 4397517
119: errsol = -1.116209489282483E-007
280: WARNING:: BalanceCheck, solar radiation balance error (W/m2)
280: nstep = 4397517
280: errsol = -1.113486405301956E-007

@jkshuman
Copy link
Contributor

updated this previous comment to reflect that the BalanceCheck solar radiation warning happens in CESM.log and LND.log

@rgknox
Copy link
Contributor Author

rgknox commented Oct 13, 2021

I see that this messaging is like you say, on the CTSM side of the code. Those logging messages may be outside the scope of this PR, but this sounds like a potential discussion topic to bring up at the CTSM software meeting?
@ekluzek @billsacks @glemieux

@glemieux
Copy link
Contributor

That sounds like a good idea @rgknox. I added it to the agenda of tomorrow's ctsm software meeting.

@jkshuman
Copy link
Contributor

@rgknox and @glemieux sounds good. It is worth noting that this radiation imbalance is being ignored on the FATES side, so perhaps we should discuss the validity of this error tolerance on the FATES side as well. (Though I am not totally sure these are the same balance checks...)

!here we are adding a within-ED radiation scheme tolerance, and then adding the diffrence onto the albedo
!it is important that the lower boundary for this is ~1000 times smaller than the tolerance in surface albedo.
if (abs(error) > 1.e-9_r8 .and. abs(error) < 0.15_r8)then
albd_parb_out(ib) = albd_parb_out(ib) + error
!this terms adds the error back on to the albedo. While this is partly inexcusable, it is
! in the medium term a solution that
! prevents the model from crashing with small and occasional energy balances issues.
! These are extremely difficult to debug, many have been solved already, leading
! to the complexity of this code, but where the system generates occasional errors, we
! will deal with them for now.
end if
if (abs(error) > 0.15_r8)then
write(fates_log(),*) 'Large Dir Radn consvn error',error ,ib
write(fates_log(),*) 'diags', albd_parb_out(ib), ftdd_parb_out(ib), &
ftid_parb_out(ib), fabd_parb_out(ib)
write(fates_log(),*) 'lai_change',lai_change(currentpatch%ncl_p,1:numpft,1:diag_nlevleaf)
write(fates_log(),*) 'elai',currentpatch%elai_profile(currentpatch%ncl_p,1:numpft,1:diag_nlevleaf)
write(fates_log(),*) 'esai',currentpatch%esai_profile(currentpatch%ncl_p,1:numpft,1:diag_nlevleaf)
write(fates_log(),*) 'ftweight',ftweight(1,1:numpft,1:diag_nlevleaf)
write(fates_log(),*) 'cp',currentPatch%area, currentPatch%patchno
write(fates_log(),*) 'ground albedo diffuse (ib)', currentPatch%gnd_alb_dir(ib)

Copy link
Contributor

@glemieux glemieux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I have one suggested change and two questions.

I also noticed that there are three current FatesWarn calls and the first index used is 2 and not 1. I'm assuming the FatesWarn call that used index=1 was removed?

main/FatesGlobals.F90 Outdated Show resolved Hide resolved
main/FatesInterfaceMod.F90 Show resolved Hide resolved
end if
! end_run
write(fates_log(), *) 'fates NL tag not recognized:',trim(tag)
!! call endrun(msg=errMsg(sourcefile, __LINE__))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to end the run if the tag isn't recognized?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing 'sf_anthro_suppression_def' was added at one point with the expectation it would be used, but the usage was never implemented. Lets remove this argument pass in the next API change on ctsm

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rgknox I am guessing this is a tag associated with anthropogenic fire suppression. You are correct that it is not yet in use. I think @glemieux or @slevisconsulting implemented this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup that was me 🤦. As Jackie noted its part of ESCOMP/CTSM#1536. Since this was a ctsm-side only update to prep for future work I guess should have commented out those lines and the lines later on down the file to avoid this issue?

https://github.com/ESCOMP/CTSM/blob/449345ee9a534316cba0d6b4fd696b8d57e60428/src/utils/clmfates_interfaceMod.F90#L822-L829

and

https://github.com/ESCOMP/CTSM/blob/449345ee9a534316cba0d6b4fd696b8d57e60428/src/utils/clmfates_interfaceMod.F90#L846-L850

Co-authored-by: Gregory Lemieux <7565064+glemieux@users.noreply.github.com>
@rgknox
Copy link
Contributor Author

rgknox commented Apr 22, 2022

Good point about the indexing @glemieux . I might had started using 2, assuming 1 would be a "universal" or "generic" code, but that is zero. I can change this to start with one. Also, I need to change the loop in the finalize routine where the totals are printed, to start with zero and not 1.

@rgknox
Copy link
Contributor Author

rgknox commented Apr 26, 2022

I believe all tests pass as expected.

rgknox/tests_0426-152504ch> ./cs.status.fails
0426-152504ch_gnu: 2 tests

 
0426-152504ch_int: 27 tests
    FAIL ERS_Lm12.1x1_brazil.I2000Clm50FatesCruRsGs.cheyenne_intel.clm-FatesFireLightningPopDens COMPARE_base_rest

@rgknox rgknox merged commit 55d40f1 into NGEET:master Apr 26, 2022
@rgknox rgknox mentioned this pull request Jan 27, 2023
@rgknox rgknox deleted the default-off-logs branch October 31, 2023 18:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants