Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop running 0th time step #2084

Open
wants to merge 37 commits into
base: master
Choose a base branch
from
Open

Stop running 0th time step #2084

wants to merge 37 commits into from

Conversation

olyson
Copy link
Contributor

@olyson olyson commented Aug 1, 2023

Description of changes

Stop running 0th time step.

Specific notes

See discussion in #925
DONE Note that "!KO" comments are still to be removed.

Contributors other than yourself, if any: @billsacks

CTSM Issues Fixed (include github issue #): #925

Are answers expected to change (and if so in what way)? Yes, more than roundoff.

Any User Interface Changes (namelist or namelist defaults changes)? No

Testing performed, if any:
See discussion in #925

@billsacks
Copy link
Member

Thanks a lot for this work, @olyson !

Are there any pieces of this that you feel warrant a careful review - i.e., that you'd like someone else to think about (if so, I would probably ask someone else to do that so that I can stay focused on some ESMF work for now)- or are you comfortable enough with the changes you made that a quick look-over should be sufficient?

Can you please go ahead and remove the !KO comments when you get a chance?

@billsacks billsacks requested review from ekluzek and removed request for billsacks August 3, 2023 16:23
@billsacks billsacks assigned olyson and unassigned billsacks Aug 3, 2023
@billsacks
Copy link
Member

From discussion today: @olyson feels reasonably confident that these changes are correct.

@samsrabin would like us to run ctsm_sci - or at least his new test - to make sure that this change doesn't break that.

@billsacks billsacks assigned ekluzek and unassigned olyson Aug 3, 2023
@olyson
Copy link
Contributor Author

olyson commented Aug 3, 2023

We agreed to leave my comments in for now to help @ekluzek perform his review and understand the reasoning behind the changes, then remove them after that.

@olyson
Copy link
Contributor Author

olyson commented Aug 3, 2023

@olyson will run the full test suite on this before @ekluzek 's review.

Copy link
Member

@billsacks billsacks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just skimmed through some pieces of this (not a full review, but just some selective checks). Thank you very much for your careful work on this @olyson and especially for the comments describing your reasoning!

I did a quick check that the changes here are consistent with some specific things I raised in #925 . One question I had from skimming back through the comments in #925 is if you remember if you did the searches through the code that I mentioned in #925 (comment):

I think when I did my earlier search I looked for nstep == 0 but didn't look for similar things like nstep /= 0, nstep > 0, nstep >= 1, etc. We'll need to check for uses like that.

src/biogeochem/CNPhenologyMod.F90 Show resolved Hide resolved
@olyson
Copy link
Contributor Author

olyson commented Aug 4, 2023

Full aux_clm test suite results (output of ./parse_cime.cs.status tests_0803-171434ch/cs.status -s):

Test summary
176 Total tests
174 Tests passed
172 Tests compare different to baseline
0 Tests are new where there is no baseline
1 Tests pending
1 Tests failed

The pending test is EXPECTED as is the failed test.
The two tests that did not compare different to baseline are:
FUNITCTSM_P1x1.f10_f10_mg37.I2000Clm50Sp.cheyenne_intel
PFS_Ld10_PS.f19_g17.I2000Clm50BgcCrop.cheyenne_intel

I'm not familiar with those two tests. Maybe someone will know why those might pass? Otherwise, I will look into it.

@billsacks
Copy link
Member

Thanks @olyson !

It makes sense that the FUNIT test doesn't show BASELINE comparison failures: that is a weird test that just runs the unit tests, so doesn't actually do baseline comparisons.

As for the PFS test: I think it might not actually produce any history files (by design)... so again, it makes sense that there are no BASELINE failures for that since there is nothing to compare.

@olyson
Copy link
Contributor Author

olyson commented Aug 4, 2023

Yep, I don't see any history files for the PFS test, thanks.

Copy link
Collaborator

@ekluzek ekluzek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main thing I thought here is that the commented out code and the !KO type comments can be removed now.

There's also comments about things like "I don't think this is needed anymore", or "I think this is still needed". I think it would be worth doing some testing to check if that's true. We can just make the code as it is now as the baseline to compare to and make sure it gives identical answers with some of these things changed. @slevis-lmwg and @olyson thoughts on doing that?

src/biogeochem/CNPhenologyMod.F90 Outdated Show resolved Hide resolved
src/biogeochem/CNVegetationFacade.F90 Outdated Show resolved Hide resolved
src/biogeochem/CropType.F90 Outdated Show resolved Hide resolved
src/biogeophys/WaterStateType.F90 Outdated Show resolved Hide resolved
src/cpl/lilac/lnd_comp_esmf.F90 Show resolved Hide resolved
src/main/accumulMod.F90 Outdated Show resolved Hide resolved
src/main/accumulMod.F90 Outdated Show resolved Hide resolved
src/main/clm_driver.F90 Outdated Show resolved Hide resolved
Copy link
Collaborator

@ekluzek ekluzek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh, @slevis-lmwg already took care of the !KO comments. So the only thing for my review is if we should validate some of those questions about "I think this is still needed...", or "I think this isn't needed"...

@slevis-lmwg slevis-lmwg removed the blocked: dependency Wait to work on this until dependency is resolved label Nov 13, 2024
@slevis-lmwg
Copy link
Contributor

slevis-lmwg commented Nov 13, 2024

Keith recommends:

  • Generating new baselines and rerunning the mosart/rtm test suites, because he expects no diffs from the code changes to those components.
  • A F2000 simulation as confirmation that all this works correctly in coupled mode now that CAM has eliminated the 0th time step, as well:
    Updated my cesm to cesm3_0_alpha05a
    ./create_newcase --compset 2000_CAM70_CLM60%SP_CICE%PRES_DOCN%DOM_MOSART_SGLC_SWAV --res ne30pg3_t232 --case /glade/u/home/slevis/cases_LMWG_dev/f2000.ne30_t232.SP --run-unsupported
    ./case.build currently returns this error (even after the Forums suggested conda activate ctsm_pylib):
    ERROR: Cannot modify case, read_only. Case must be opened with read_only=False and can only be modified within a context manager
    Creating the case from Cecile's checkout of cesm3_0_beta04 WORKED, so now I need to use the case's /SourceMods.
    BUT it complicates things that this tag points to ctsm5.3.007, mosart1.1.02, rtm1_0_80.
    I now cloned my own checkout of cesm3.0-alphabranch, which points to the same component versions as Cecile's.

@@ -595,7 +595,7 @@ subroutine update_accum_field_timeavg(this, level, nstep, field)

do k = begi,endi
effective_nstep = nstep - this%ndays_reset_shifted(k,level)
time_to_reset = (mod(effective_nstep,this%period) == 1 .or. this%period == 1) .and. effective_nstep /= 0
time_to_reset = mod(effective_nstep,this%period) == 1 .or. this%period == 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@samsrabin this is how I handled the new version of the code, as I mentioned in my question to you.

Copy link
Collaborator

@samsrabin samsrabin Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I'm pretty sure this change should be reverted. effective_nstep refers not to the number of timesteps since run start/restart, but rather to the number of timesteps since the accumulator field was initialized/reset.

However, before you revert the change, could you check whether the unit tests in test_accumul.pf pass? If so, I think that indicates a gap in the testing coverage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @samsrabin I will revert the change.
I need to request a tutorial on running test_accumul.pf.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it's just run as part of the unit tests. So if those are passing then I need to refine them.

@slevis-lmwg
Copy link
Contributor

slevis-lmwg commented Nov 16, 2024

derecho testing
PASS python testing
PASS build-namelist_test.pl

OK ./run_sys_tests -s mosart -c mosart1.1.04-ctsm5.3.012_hist_time_mid -g mosart1.1.05-ctsm5.3.012_zerothtstep
OK ./run_sys_tests -s rtm -c rtm1_0_82-ctsm5.3.012_hist_time_mid -g rtm1_0_83-ctsm5.3.012_zerothtstep
OK? ./run_sys_tests -s aux_clm -c ctsm5.3.012_hist_time_mid -g ctsm5.3.012_zerothtstep

izumi testing

OK ./run_sys_tests -s mosart -c mosart1.1.04_ctsm5.3.012_hist_time_mid -g mosart1.1.05_ctsm5.3.012_zerothtstep
OK ./run_sys_tests -s aux_clm -c ctsm5.3.012_hist_time_mid -g ctsm5.3.012_zerothtstep

@slevis-lmwg
Copy link
Contributor

slevis-lmwg commented Nov 18, 2024

A mosart nvhpc test was introduced in mosart1.1.03 and has been failing ever since in the build phase:
SMS_D_Ld5.f10_f10_mg37.I1850Clm60Sp.derecho_nvhpc.mosart-default SHAREDLIB_BUILD
I have attributed the failure to issue #1733 and have ignored it. I should probably add EXPECTED FAILURE to it.

DPFCT_ROCK is different in the base.cprnc.out file of
ERI_D.ne30pg3_t232.I1850Clm60BgcCropG.derecho_intel.clm-clm60cam7LndTuningModeLDust.GC.1115-165932de_int
The formula and comments in the code suggest that this is a constant:

this%dpfct_rock_patch(p) = 1.0_r8 - ( log(this%prigent_roughness_stream%prigent_rghn(g)*0.01_r8/z0s) &
                            / log(b1 * (X/z0s)**b2 ) )

Nothing else is different, so I'm guessing this constant needs a correction at initialization.

@slevis-lmwg
Copy link
Contributor

slevis-lmwg commented Nov 20, 2024

Following up on the DPFCT_ROCK difference in the base.cprnc.out file.

ncdump ERI_D.ne30pg3_t232.I1850Clm60BgcCropG.derecho_intel.clm-clm60cam7LndTuningModeLDust.GC.1115-165932de_int.clm2.h0.0003-01-20-00000.nc > ERI_D.ne30pg3_t232.I1850Clm60BgcCropG.derecho_intel.clm-clm60cam7LndTuningModeLDust.GC.1115-165932de_int.clm2.h0.0003-01-20-00000.asc

ncdump ERI_D.ne30pg3_t232.I1850Clm60BgcCropG.derecho_intel.clm-clm60cam7LndTuningModeLDust.GC.1115-165932de_int.clm2.h0.0003-01-20-00000.nc.base > ERI_D.ne30pg3_t232.I1850Clm60BgcCropG.derecho_intel.clm-clm60cam7LndTuningModeLDust.GC.1115-165932de_int.clm2.h0.0003-01-20-00000.asc.base

diff ...asc.base ...asc > dif.out
vi dif.out
239589c239589
<     0.486517470277726, 0.435177136876831, 0.411865850359667,
---
>     0.463483699076243, 0.435177136876831, 0.411865850359667,

Same diff when I compare ...asc.base to ...asc.rest (because the ...asc and ...asc.rest files are the same)
No diff when I compare ...asc.base to ...asc.hybrid (I think this means that startup and hybrid work the same)

@slevis-lmwg
Copy link
Contributor

slevis-lmwg commented Nov 21, 2024

In the latest #2052 aux_clm where I included this branch so as to perform cumulative testing, the same test passes from what I can tell, and I posted my interpretation in that PR.

I'm starting manual restart testing here to try and troubleshoot, and to assess whether I'm misinterpreting this test failure:

  1. 2-day versus 1+1-day simulations are b4b.
  2. branch?

I have not reproduced the problem, yet. Summary of my understanding from the test failure this far:

  • Answers changed from the baseline due to removal of 0th time step (expected)
  • Answers also changed in restart (or branch) but just for one field described as constant
  • Answers in the next PR, which includes the code from this PR, didn't change and restart passed
  • So I'm inclined to ignore the answer change in restart in this PR for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
external issue needs to be addressed elsewhere (submodule); issue here for the sake of project tracking PR status: ready PR: this is ready to merge in, with all tests satisfactory and reviews complete
Projects
Status: On the grill (work in progress)
Status: In progress - master/b4b-dev
Status: In Progress
Development

Successfully merging this pull request may close these issues.

6 participants