Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Post data-assimilation processing broken with multi-driver #389

Closed
gold2718 opened this issue May 17, 2018 · 15 comments
Closed

Post data-assimilation processing broken with multi-driver #389

gold2718 opened this issue May 17, 2018 · 15 comments
Assignees
Labels
bug something is working incorrectly
Milestone

Comments

@gold2718
Copy link

While I have not tested this in CTSM, CAM has a similar issue. When using multiple instances without multi-driver, num_inst_lnd == NINST_LND and each instance has a unique instance number, inst_index. However, when multi-driver is on, num_inst_lnd == 1 but each instance still has a unique inst_index.
This becomes a problem in line 686 of lnd_comp_mct.F90:
https://github.com/ESCOMP/ctsm/blob/e739c2c08ca84dcc71d5b9da58051ac4c07ccf59/src/cpl/lnd_comp_mct.F90#L686
which walks off the end of the lnd_resume array. I suggest making the following fix:

 if ( len_trim(lnd_resume(MIN(num_inst_lnd,inst_index))) == 0 )then
@billsacks billsacks added the bug something is working incorrectly label May 18, 2018
@billsacks
Copy link
Member

@gold2718 does this need to be fixed for the cesm2 release?

@gold2718
Copy link
Author

@billsacks, It is needed to support usage of DART with CLM.
The code is also incorrect for multi-instance, multi-driver runs in that it is making an invalid array reference (e.g., lnd_resume(3) when size(lnd_resume) == 1). This should cause any debug run to crash along with unpredictable behavior in non-debug runs.
It is your call but I suggest fixing it in the release branch.

@ekluzek
Copy link
Collaborator

ekluzek commented May 18, 2018

@gold2718 can you point me to a DART case for CESM2.0? Or let me know what it would look like? My current test case is here..

cime_config/testdefs/testmods_dirs/clm/pauseResume

and it does the following...

# Test PAUSE/RESUME functionality which is used for Data Assimulation (DA)
./xmlchange LND_PAUSE_ACTIVE=TRUE
./xmlchange LND_DATA_ASSIMILATION=TRUE
./xmlchange CPL_PAUSE_ACTIVE=TRUE
./xmlchange CPL_DATA_ASSIMILATION=TRUE
./xmlchange PAUSE_OPTION=nhours
./xmlchange PAUSE_N=6
# Set calendar, just because Gregorian is ussually used for DA
./xmlchange CALENDAR=GREGORIAN

The test I run is.

SMS_N2_D.09_g16.I2000Clm50SpGs.cheyenne_intel.clm/pauseResume

It looks like you are suggesting I add this setting?

MULTI_DRIVER=TRUE

Is that correct?

@gold2718
Copy link
Author

I think that there is a case in /glade/p/work/raeder/Exp/No_sourcemods_test1 but it is an atmosphere test. You would have to ask someone else for a DART/CLM test case.
As for your settings, do not use any pause settings and I do not see a need for any CPL settings.

To test this, there are three types of cases:

  1. NINST_LND = 1: num_inst_lnd = 1, and inst_index = 1
  2. NINST_LND = n, MULTI_DRIVER = FALSE: num_inst_lnd = n, and 1 <= inst_index <= n
  3. NINST_LND = n, MULTI_DRIVER = TRUE: num_inst_lnd = 1, and 1 <= inst_index <= n

You can run these cases without any DART involvement. If you want to run data assimilation cycles, use scripts/data_assimilation/da_no_data_mod.sh:

./xmlchange DATA_ASSIMILATION_SCRIPT=<cime_root>/scripts/data_assimilation/da_no_data_mod.sh
./xmlchange DATA_ASSIMILATION_CYCLES=2

You can also copy da_no_data_mod.sh and modify it DWAV ==> CLM which would set LND_DATA_ASSIMILATION=TRUE after the first cycle.

Again, do NOT use any PAUSE settings.

@ekluzek ekluzek added this to the cesm2 milestone May 22, 2018
@ekluzek
Copy link
Collaborator

ekluzek commented May 24, 2018

OK, I have a test case that correctly fails as expected as follows...

2407:forrtl: severe (408): fort: (2): Subscript #1 of the array LND_RESUME has value 2 which is greater than the upper bound of 1
2407:
2407:Image              PC                Routine            Line        Source
2407:cesm.exe           00000000041584B6  Unknown               Unknown  Unknown
2407:cesm.exe           000000000084AC2D  lnd_comp_mct_mp_l         686  lnd_comp_mct.F90
2407:cesm.exe           0000000000844747  lnd_comp_mct_mp_l         262  lnd_comp_mct.F90
2407:cesm.exe           000000000045362E  component_mod_mp_         267  component_mod.F90
2407:cesm.exe           0000000000419BB1  cime_comp_mod_mp_        1181  cime_comp_mod.F90
2407:cesm.exe           0000000000449583  MAIN__                     92  cime_driver.F90
2407:cesm.exe           000000000040819E  Unknown               Unknown  Unknown
2407:libc-2.19.so       00002AAAB04DCB25  __libc_start_main     Unknown  Unknown
2407:cesm.exe           00000000004080A9  Unknown               Unknown  Unknown

@ekluzek ekluzek self-assigned this May 24, 2018
@ekluzek
Copy link
Collaborator

ekluzek commented May 24, 2018

The proposed fix now works, and I have a test case that does a good job of exercising this...

DAE_N2_D.f09_g16.I2000Clm50BgcCropGs.cheyenne_intel.clm-DA_multidrv

PASS DAE_N2_D.f09_g16.I2000Clm50BgcCropGs.cheyenne_intel.clm-DA_multidrv RUN time=630
PASS DAE_N2_D.f09_g16.I2000Clm50BgcCropGs.cheyenne_intel.clm-DA_multidrv COMPARE_base_da

@ekluzek
Copy link
Collaborator

ekluzek commented May 24, 2018

There's another missing check that I needed to add as well. Here's the difference with that added in.

diff --git a/src/biogeochem/CNVegetationFacade.F90 b/src/biogeochem/CNVegetationFacade.F90
index 10c8164..9f555f6 100644
--- a/src/biogeochem/CNVegetationFacade.F90
+++ b/src/biogeochem/CNVegetationFacade.F90
@@ -985,6 +985,7 @@ subroutine BalanceCheck(this, bounds, num_soilc, filter_soilc, &
     ! Should only be called if use_cn is true
     !
     ! !USES:
+    use clm_time_manager   , only : get_nstep_since_startup_or_lastDA_restart_or_pause
     !
     ! !ARGUMENTS:
     class(cn_vegetation_type)               , intent(inout) :: this
@@ -995,15 +996,15 @@ subroutine BalanceCheck(this, bounds, num_soilc, filter_soilc, &
     type(soilbiogeochem_nitrogenflux_type)  , intent(inout) :: soilbiogeochem_nitrogenflux_inst
     !
     ! !LOCAL VARIABLES:
-    integer              :: nstep                   ! time step number
+    integer              :: DA_nstep                   ! time step number
 
     character(len=*), parameter :: subname = 'BalanceCheck'
     !-----------------------------------------------------------------------
 
-    nstep = get_nstep()
-    if (nstep < 2 )then
+    DA_nstep = get_nstep_since_startup_or_lastDA_restart_or_pause()
+    if (DA_nstep < 2 )then
        if (masterproc) then
-          write(iulog,*) '--WARNING-- skipping CN balance check for first timestep'
+          write(iulog,*) '--WARNING-- skipping CN balance check for first timesteps after startup or data assimilation'
        end if
     else
 

@timhoar
Copy link

timhoar commented May 25, 2018

My experience with hash e739c2c
is that for a RUN_TYPE = hybrid and
${cesmroot}/cime/scripts/create_newcase (snip unrelated stuff) --ninst 80 --multi-driver

the lnd_in:finidat is getting created without the instance string. I am not running DART in any capacity,
just trying to continue a free run I had in hand.
The case directory is: /glade/p/work/thoar/cases/ctsm/clm5_f09_spinup80
The run directory is: /glade2/scratch2/thoar/ctsm/clm5_f09_spinup80/run

The script that created the case is called /glade/p/work/thoar/cases/ctsm/clm5_f09_spinup80/CLM5_spinup_hybrid.original

@ekluzek
Copy link
Collaborator

ekluzek commented May 25, 2018

@timhoar it looks like the REFDATE is wrong in the script (2001-07-01 instead of 2001-01-01) that created the softlinks to the files, hence the files don't actually exist. I verified that there is code in place that searches for files with the instance name first, and then uses it without if not. So it should work if the REFDATE is correct.

@timhoar
Copy link

timhoar commented May 25, 2018

While that may be, I do not believe that is the problem.

The refdate should be 2001-07-01, the links to the files should be correct - I will check into why the ultimate targets are nonexistent. However:
The problem is that the every lnd_in_xxxx specification of finidat is 'finidat = 'clm5_f09_spinup80.clm2.r.2001-07-01-00000.nc' which does not have an instance number in it.

@timhoar
Copy link

timhoar commented May 25, 2018

I have no idea what is going on with that case ... I pared it back to 4 instances and added the missing
part of the link target and now the lnd_in_xxxx filenames are correct. Question: if the 'expected' finidat filename does not exist (as in the filename is wrong) - does a default filename get used? One without an instance number, perhaps?

The new, tiny case is: /glade/p/work/thoar/cases/ctsm/clm5_f09_freerun4

@ekluzek
Copy link
Collaborator

ekluzek commented May 25, 2018

@timhoar yes exactly as it's coded now it will only add an instance number in, if it can find a file in the RUNDIR that matches the REFCASE,REFDATE, and instance number, otherwise it will default to the form without the instance number (even if that file doesn't currently exist). So those files have to be in place and correctly named for it to work.

@timhoar
Copy link

timhoar commented May 29, 2018 via email

@ekluzek
Copy link
Collaborator

ekluzek commented May 29, 2018

@timhoar it makes sense that this strategy would be awkward for you. The same strategy is used in other components (and I think all other components, but know for sure, cam, mosart, and rtm) The way around this would be to have another flag that triggers what form should be used (multi-instance or single instance form). But, that would require some extra infrastructure. And you'd probably want it across all components. That's not something we can do now, but it you'd like it for the future, you could open a request about it. And we could discuss adding it in. I just think that if we do we should coordinate it across all components.

@ekluzek
Copy link
Collaborator

ekluzek commented Jun 12, 2018

Fixed in clm5.0.dev013

@ekluzek ekluzek closed this as completed Jun 12, 2018
billsacks pushed a commit to billsacks/ctsm that referenced this issue Feb 22, 2019
AGonzalezNicolas pushed a commit to HPSCTerrSys/clm5_0 that referenced this issue Jun 27, 2024
AGonzalezNicolas pushed a commit to HPSCTerrSys/clm5_0 that referenced this issue Jul 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug something is working incorrectly
Projects
None yet
Development

No branches or pull requests

4 participants