Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop running 0th time step if possible, or at least clean up its handling in history output #925

Closed
billsacks opened this issue Feb 12, 2020 · 16 comments · Fixed by #2084
Closed
Assignees
Labels
bug something is working incorrectly priority: high High priority to fix/merge soon, e.g., because it is a problem in important configurations

Comments

@billsacks
Copy link
Member

billsacks commented Feb 12, 2020

Within CESM, we run CTSM for an extra 0th timestep at the start of a simulation. If I remember correctly, this was done because of some need of CAM's that I either never understood or have forgotten.

I'm thinking that we should not do this when running CTSM via LILAC. This could be at least slightly important if you're trying to start at a very exact time for a short weather prediction run.

In fact, it's questionable whether we should still be doing this in CESM.

@mvertens do you agree that we should ideally change this behavior to avoid the 0th timestep when running with LILAC?

[(2020-07-14) I have removed this from the LILAC project because it is potentially a broader issue than just LILAC, but I am keeping a reference to it in that project.]

@billsacks billsacks added the bug something is working incorrectly label Feb 12, 2020
@mvertens
Copy link

mvertens commented Feb 13, 2020 via email

@billsacks billsacks changed the title Do not run 0th timestep when running CTSM via LILAC Do not run 0th timestep when running CTSM via LILAC - and maybe always Jul 14, 2020
@billsacks
Copy link
Member Author

billsacks commented Jul 14, 2020

From a discussion yesterday - @mvertens thinks that the whole 0th timestep thing may be a historical artifact that should be removed even within CESM – but this probably needs a bit more analysis.

@billsacks
Copy link
Member Author

If we don't completely remove the 0th time step, then we should at least fix history averaging / output to avoid the messy off-by-one issues that arise when doing non-monthly output: this 0th time step shifts all other history output and so makes the aggregation of history times non-intuitive.

@billsacks billsacks changed the title Do not run 0th timestep when running CTSM via LILAC - and maybe always Stop running 0th time step if possible, or at least clean up its handling in history output Nov 3, 2021
@billsacks billsacks moved this from Needs prioritization to Todo in CESM: infrastructure / cross-component SE priorities Nov 3, 2021
@billsacks
Copy link
Member Author

This came up again, and @mvertens expressed more certainty that the 0th time step is a historical artifact that should just be removed: it is a relic from when CLM was a subroutine call within CAM and CAM ran this 0th time step so CLM needed to as well. They kept it in place as they migrated CLM to its own component so that answers would stay the same across that migration, but that is far enough in the past that she feels it can (and should) be removed. It's unclear whether CAM needs to keep it or if the 0th time step can be removed on the CAM side, too, but it feels pretty clear that it can be removed on the CTSM side now.

@billsacks
Copy link
Member Author

billsacks commented Feb 2, 2022

We have a number of uses of the function is_first_step, or equivalent direct checks of nstep == 0. From a quick look through the code, it looks like most of those checks should be removed once we tackle this issue, because they seem intended to check if it is this special time step 0. However, the following two uses seem to intend to check if this is the first time step of the run – and so should be changed to check if nstep == 1 (probably by changing the implementation of is_first_step):

  • WaterStateType.F90: Restart subroutine
  • clm_driver.F90: need_glacier_initialization

Update (2022-02-03): I just added one more check of is_first_step that should be backed out when we tackle this issue: see the changes in CNPhenology in dadbc62.

Update (2022-07-13) I noticed a check of nstep /= 0 in accumulMod that will also need some thought. I think when I did my earlier search I looked for nstep == 0 but didn't look for similar things like nstep /= 0, nstep > 0, nstep >= 1, etc. We'll need to check for uses like that.

@billsacks billsacks added next this should get some attention in the next week or two. Normally each Thursday SE meeting. priority: high High priority to fix/merge soon, e.g., because it is a problem in important configurations labels Mar 29, 2022
@billsacks
Copy link
Member Author

billsacks commented Mar 29, 2022

Some other things we should do / check:

  • Make sure nstep now starts at 1 rather than 0
  • Check time_bounds on history files: monthly and non-monthly; first non-monthly and some later one(s): make sure they look correct now

Repository owner moved this from Todo ~ months to Done in CESM: infrastructure / cross-component SE priorities Mar 29, 2022
@billsacks billsacks reopened this Mar 29, 2022
Repository owner moved this from Done to Needs Prioritization in CESM: infrastructure / cross-component SE priorities Mar 29, 2022
@billsacks billsacks removed the next this should get some attention in the next week or two. Normally each Thursday SE meeting. label Mar 31, 2022
@slevis-lmwg
Copy link
Contributor

I'm tentatively assigning this issue to @olyson based on this morning's conversation. Assignment may change after we meet with @billsacks

@billsacks
Copy link
Member Author

billsacks commented May 23, 2023

How to start with time step 1 instead of 0? A straightforward way is probably to add a call to advance_timestep at the end of timemgr_init.

However, need to investigate logic for setting tm_first_restart_step = .false.: we may not want to do that in this first call. (We could introduce an optional argument that says to skip that if needed.)

@billsacks
Copy link
Member Author

billsacks commented May 23, 2023

How to test to make sure we have this right? @slevis-lmwg suggests: after this change, we think that an initial (non-cold-start) run should be identical to a restart / continue_run.

However, we might need to temporarily remove checks of is_first_step (or nstep == 1) to get identical results.

@billsacks
Copy link
Member Author

@olyson points out: will want to do a short test in coupled mode to make sure nothing is messed up there.

@olyson
Copy link
Contributor

olyson commented Jun 1, 2023

Created a branch for this work:

https://github.com/olyson/ctsm/tree/zerothtstep

Modified the code to start at nstep=1 by adding a call to advance_timestep at the end of timemgr_init in the clm time manager.
Changed the implementation of is_first_step to key off of nstep=1 instead of nstep=0.
Reviewed the code for instances of nstep == 0, nstep == 1, nstep /= 0, nstep > 0, etc (with and without spaces), and is_first_step, and made changes as required.
Verified that nstep now starts at 1 rather than 0 (evident in the lnd log).
Ran the clm_short test suite, plus one CLM51 test. All pass except for baseline comparison as expected.
I checked the time_bounds on monthly, daily, and timestep (1/2 hourly) history files:

File Type Old New
Monthly 0, 31 0, 31
Daily(1) 0, 0 0, 1
Daily(2) 0, 1 1, 2
Timestep(1) 0, 0 0, 0.0208
Timestep(2) 0, 0.0208 0.0208, 0.0417

The changes to time_bounds for daily and timestep files look appropriate. The monthly file doesn't change. I think this is ok because the old monthly files didn't include nstep=0 in the average because of this code in hist_htapes_wrapup:

! Skip nstep=0 if monthly average

if(nstep==0 .and. tape(t)%nhtfrq==0) then
cycle
end if

Note that the daily and timestep file names also change, e.g., for the first file in a run:

File Type Old New
Daily 2000-01-01-00000 2000-01-02-00000
Timestep 2000-01-01-00000 2000-01-01-01800

Per @slevis-lmwg suggestion that an initial run should now be bfb with a restart when using the same initial/restart file, I set up a pair of simulations to demonstrate that. These were not bfb initially. Per @billsacks suggestion that we might need to temporarily remove checks of is_first_step (or nstep==1) to get identical results, I found that two SourceMods were required for the startup/initial simulation, one in lnd_comp_nuopc to remove an nstep==1 statement related to doalb and one in WaterStateType to remove a is_first_step check for soil water bounds adjustment (actually accomplished by changing this to .not.is_first_step). Initially this resulted in bfb for all monthly average history fields except for methane related variables, and TWS, VOLR, VOLRMCH. However, when I prescribed the MOSART initial file to be the same in both runs, these variables were now bfb. Makes sense since TWS includes river water and the methane variables depend on TWS (through the inundated fraction I believe).

Some open issues/questions we can discuss at some point:

  1. Some testing of this in coupled (CAM/CLM) mode is required.
  2. RTM and MOSART have separate time managers that require the same changes made to the clm time manager. I've made these changes in my code base for my simulations, but I assume RTM and MOSART branches will need to be created and the changes committed to those branches. It's not obvious to me what time manager mizuRoute uses.
  3. So far I haven't touched lilac, unit tests, and mct. I assume we won't bother with mct, but it looks like some changes will be required for lilac and unit tests.
  4. My commit to my branch includes comments about my rationale for why I thought changes were or weren't necessary, these should be removed at some point.
  5. There was a question about whether we should be setting tm_first_restart_step = .false.. I don't quite understand the potential issue there so I haven't made any changes to that.

@olyson
Copy link
Contributor

olyson commented Jun 15, 2023

MOSART and RTM branches:

https://github.com/olyson/MOSART/tree/zerothtstep
https://github.com/olyson/RTM/tree/zerothtstep

@billsacks
Copy link
Member Author

@olyson - just catching up on some stuff from the last couple of weeks. This is amazing - thank you very much for doing this!!! Awesome work and I'm really happy to hear that things went smoothly according to @slevis-lmwg 's suggestion for testing!

I'll be out again next week, but would be happy to be part of a discussion after that to go through the final questions. Wow, this just makes me so happy :-)

@olyson olyson moved this from Todo to In Progress in CTSM tasks for supporting coupled simulations Jul 10, 2023
@billsacks billsacks moved this from Todo ~ months to In Progress in CESM: infrastructure / cross-component SE priorities Jul 25, 2023
@billsacks
Copy link
Member Author

However, need to investigate logic for setting tm_first_restart_step = .false.: we may not want to do that in this first call. (We could introduce an optional argument that says to skip that if needed.)

@olyson and I looked into this today. It looks like we do not need to make any changes to the logic for tm_first_restart_step: I was concerned that, with the changes @olyson is making, tm_first_restart_step will now be set to false in initialization rather than waiting until the end of the first time step to do so. But looking more closely at call sequences, @olyson 's changes (to call advance_timestep from timemgr_init) only apply in a startup run (because that's the only time when timemgr_init is called), and tm_first_restart_step is always false in a startup run anyway. i.e., the changes in @olyson 's branch should not interfere with the logic for tm_first_restart_step.

@olyson
Copy link
Contributor

olyson commented Aug 1, 2023

@billsacks and I met on 7-25-23.

  1. As noted above, no changes needed to tm_first_restart_step.
  2. Some changes to unit testing were needed. @billsacks made those changes and I merge them in.
  3. I made changes to lilac in src/cpl/lilac/lnd_comp_esmf.F90, similar to src/cpl/nuopc/lnd_comp_esmf.F90, and ran LILACSMOKE_D_Ld2.f10_f10_mg37.I2000Ctsm50NwpSpAsRs.cheyenne_intel.clm-lilac. PASS except comparison to baseline as expected.
  4. I checked out cam6_3_119 and updated to my zerothtstep branches for CTSM and MOSART. Repeated analysis I did for the CAM radiation reordering investigation (Investigate effects of CAM radiation reordering on CAM/CLM radiation coupling #1782). As expected, by removing the 0th timestep, the number of time steps where solar radiation and albedo are not synched up is reduced by 1. Solar imbalance is reduced by 1 timestep as well. I extended these simulations (0th timestep and control) for a few years and ran diagnostics. Differences are fairly small:

https://webext.cgd.ucar.edu/FHIST/cam6ctsm51sp_cam6_3_119zerothtstep_1deg_hist/lnd/cam6ctsm51sp_cam6_3_119zerothtstep_1deg_hist.1980_1984-cam6ctsm51sp_cam6_3_119_1deg_hist.1980_1984/setsIndex.html

On the other hand, the atm log file indicates that CAM is still running the 0th time step, so similar changes will need to be made to CAM code, e.g., src/utils/time_manager.F90

@billsacks
Copy link
Member Author

Awesome - this is all great, @olyson - thank you!!!

Can you please open a PR for this? Then I think I said I can do the final testing - for MOSART and RTM as well as CTSM - and I'm still happy to do that.

@ekluzek ekluzek moved this to In progress in CTSM: High priority Aug 15, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Done (or no longer holding things up) in CESM: infrastructure / cross-component SE priorities Jan 14, 2025
slevis-lmwg added a commit to samsrabin/CTSM that referenced this issue Jan 14, 2025
Stop running 0th time step

For consistency with CAM.
Fixes ESCOMP#925
Changes answers more than roundoff, same climate.

slevis resolved conflicts:
src/main/histFileMod.F90
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug something is working incorrectly priority: high High priority to fix/merge soon, e.g., because it is a problem in important configurations
Projects
Status: Done (or no longer holding things up)
Status: In progress
Development

Successfully merging a pull request may close this issue.

4 participants