Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exact restart problem with Fates #667

Closed
ekluzek opened this issue Mar 24, 2019 · 7 comments · Fixed by #2199
Closed

Exact restart problem with Fates #667

ekluzek opened this issue Mar 24, 2019 · 7 comments · Fixed by #2199
Labels
bug something is working incorrectly FATES API update Changes to the FATES version that also REQUIRE an API change in CTSM

Comments

@ekluzek
Copy link
Collaborator

ekluzek commented Mar 24, 2019

Brief summary of bug

Activation of planthydro in Fates on fates_next_api seems to have an exact restart issue.

General bug information

CTSM version you are using: clm5.0.dev008-90-g04e931e

Does this bug cause significantly incorrect results in the model's science? No?

Configurations affected:

ERS_D_Ld5.1x1_brazil.I2000Clm50FatesCruGs.cheyenne_intel.clm-FatesHydro

The test passes when use_fates_planthydro=.false.

Details of bug

Important output or errors that show the problem

cases/ERS_D_Ld5.1x1_brazil.I2000Clm50FatesCruGs.cheyenne_intel.clm-FatesHydro.GC.clm5d8-fna-n1chintelf> tail  /glade/scratch/erik/ERS_D_Ld5.1x1_brazil.I2000Clm50FatesCruGs.cheyenne_intel.clm-FatesHydro.GC.clm5d8-fna-n1chintelf/run/ERS_D_Ld5.1x1_brazil.I2000Clm50FatesCruGs.cheyenne_intel.clm-FatesHydro.GC.clm5d8-fna-n1chintelf.cpl.hi.0001-01-06-00000.nc.base.cprnc.out
  
SUMMARY of cprnc:
 A total number of    214 fields were compared
          of which     13 had non-zero differences
               and      0 had differences in fill patterns
               and      0 had different dimension sizes
 A total number of      0 fields could not be analyzed
 A total number of      0 fields on file 1 were not found on file2.
  diff_test: the two files seem to be DIFFERENT 
  
cases/ERS_D_Ld5.1x1_brazil.I2000Clm50FatesCruGs.cheyenne_intel.clm-FatesHydro.GC.clm5d8-fna-n1chintelf> grep RMS /glade/scratch/erik/ERS_D_Ld5.1x1_brazil.I2000Clm50FatesCruGs.cheyenne_intel.clm-FatesHydro.GC.clm5d8-fna-n1chintelf/run/ERS_D_Ld5.1x1_brazil.I2000Clm50FatesCruGs.cheyenne_intel.clm-FatesHydro.GC.clm5d8-fna-n1chintelf.cpl.hi.0001-01-06-00000.nc.base.cprnc.out
 RMS l2x_Sl_tref                      1.3476E-07            NORMALIZED  4.5143E-10
 RMS l2x_Sl_qref                      1.4605E-10            NORMALIZED  7.8464E-09
 RMS l2x_Sl_t                         1.3957E-06            NORMALIZED  4.6585E-09
 RMS l2x_Sl_fv                        1.4396E-08            NORMALIZED  1.5546E-07
 RMS l2x_Sl_ram1                      2.7491E-05            NORMALIZED  1.7714E-07
 RMS l2x_Sl_u10                       9.0813E-08            NORMALIZED  7.9571E-08
 RMS l2x_Fall_taux                    3.6426E-10            NORMALIZED  1.7989E-07
 RMS l2x_Fall_tauy                    3.6426E-10            NORMALIZED  1.7989E-07
 RMS l2x_Fall_lat                     4.1475E-07            NORMALIZED  2.8354E-08
 RMS l2x_Fall_sen                     1.5744E-05            NORMALIZED  1.2406E-06
 RMS l2x_Fall_lwup                    8.5134E-06            NORMALIZED  1.8634E-08
 RMS l2x_Fall_evap                    1.6584E-13            NORMALIZED  2.8354E-08
 RMS l2x_Flrl_rofsur                  4.5949E-15            NORMALIZED  4.9016E-10
@rgknox
Copy link
Collaborator

rgknox commented Mar 24, 2019

@ekluzek, this is a known issue, we do not get exact restarts on fates_next_api with ngeet/fates master either.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Mar 24, 2019

Glad it's a known issue. Note, that some restart tests pass.

PASS ERS_D_Ld3.f19_g16.I2000Clm50FatesCruGs.cheyenne_gnu.clm-FatesColdDef COMPARE_base_rest
PASS ERS_Ld60.f45_f45_mg37.I2000Clm50FatesCruGs.cheyenne_intel.clm-Fates COMPARE_base_rest
PASS ERS_D_Ld3.f19_g16.I2000Clm50FatesCruGs.cheyenne_intel.clm-FatesColdDef COMPARE_base_rest
PASS ERS_Ld60.f45_f45_mg37.I2000Clm50FatesCruGs.cheyenne_intel.clm-FatesPPhys COMPARE_base_rest
PASS ERS_D_Ld5.f19_g16.I2000Clm50BgcCruGs.cheyenne_intel.clm-default COMPARE_base_rest
PASS ERS_D_Mmpi-serial_Ld5.1x1_brazil.I2000Clm50FatesCruGs.cheyenne_intel.clm-FatesColdDef COMPARE_base_rest
PASS ERS_Ld60.f45_f45_mg37.I2000Clm50FatesCruGs.cheyenne_intel.clm-FatesNoFire COMPARE_base_rest
PASS ERS_Ld60.f45_f45_mg37.I2000Clm50FatesCruGs.cheyenne_intel.clm-FatesST3 COMPARE_base_rest
PASS ERS_Ld60.f45_f45_mg37.I2000Clm50FatesCruGs.cheyenne_intel.clm-FatesLogging COMPARE_base_rest

@ekluzek
Copy link
Collaborator Author

ekluzek commented Mar 25, 2019

Note, and issue on NGEET/fates that this relates to is: NGEET/fates#315

@ekluzek
Copy link
Collaborator Author

ekluzek commented Mar 25, 2019

Note also fails on hobart_nag:

ERS_D_Ld5.1x1_brazil.I2000Clm50FatesCruGs.hobart_nag.clm-FatesHydro

0 : scount()= 1
box_rearrange::compute_counts:: myrank= 0 : recv indices from 0 count= 1
0 : scount()= 17
box_rearrange::compute_counts:: myrank= 0 : recv indices from 0 count= 17
Runtime Error: *** Arithmetic exception: Floating invalid operation - aborting
/fs/cgd/data0/erik/ctsm_fates/src/fates/biogeophys/FatesPlantHydraulicsMod.F90, line 3666: Error occurred in FATESPLANTHYDRAULICSMOD:DFLCGSDPSI_FROM_PSI
/fs/cgd/data0/erik/ctsm_fates/src/fates/biogeophys/FatesPlantHydraulicsMod.F90, line 3329: Called by FATESPLANTHYDRAULICSMOD:HYDRAULICS_1DSOLVE
/fs/cgd/data0/erik/ctsm_fates/src/fates/biogeophys/FatesPlantHydraulicsMod.F90, line 2816: Called by FATESPLANTHYDRAULICSMOD:HYDRAULICS_BC
/fs/cgd/data0/erik/ctsm_fates/src/fates/biogeophys/FatesPlantHydraulicsMod.F90, line 179: Called by FATESPLANTHYDRAULICSMOD:HYDRAULICS_DRIVE
/fs/cgd/data0/erik/ctsm_fates/src/utils/clmfates_interfaceMod.F90, line 2411: Called by CLMFATESINTERFACEMOD:WRAP_HYDRAULICS_DRIVE
/fs/cgd/data0/erik/ctsm_fates/src/biogeophys/CanopyFluxesMod.F90, line 1270: Called by CANOPYFLUXESMOD:CANOPYFLUXES
/fs/cgd/data0/erik/ctsm_fates/src/main/clm_driver.F90, line 543: Called by CLM_DRIVER:CLM_DRV
/fs/cgd/data0/erik/ctsm_fates/src/cpl/lnd_comp_mct.F90, line 451: Called by LND_COMP_MCT:LND_RUN_MCT
/fs/cgd/data0/erik/ctsm_fates/cime/src/drivers/mct/main/component_mod.F90, line 724: Called by COMPONENT_MOD:COMPONENT_RUN
/fs/cgd/data0/erik/ctsm_fates/cime/src/drivers/mct/main/cime_comp_mod.F90, line 2447: Called by CIME_COMP_MOD:CIME_RUN
/fs/cgd/data0/erik/ctsm_fates/cime/src/drivers/mct/main/cime_driver.F90, line 133: Called by CIME_DRIVER
[h018.cgd.ucar.edu:mpi_rank_0][error_sighandler] Caught error: Aborted (signal 6)

@ekluzek
Copy link
Collaborator Author

ekluzek commented Mar 25, 2019

The line pointed to before with nag is this:

    dflcgsdpsi = -1._r8 * (1._r8 + (lwp/p50_gs(FT))**avuln_gs(FT))**(-2._r8) * &
                          avuln_gs(FT)/p50_gs(FT)*(lwp/p50_gs(FT))**(avuln_gs(FT)-1._r8)

@ekluzek ekluzek added FATES API update Changes to the FATES version that also REQUIRE an API change in CTSM bug something is working incorrectly labels May 28, 2019
@slevis-lmwg
Copy link
Contributor

slevis-lmwg commented Jun 8, 2020

The restart problem documented in this issue appears resolved according to my FATES testing in PR #991 . I have confirmed this by submitting the test both from the PR's branch (lightning_v2_ctsm) as well as from fates_next_api.

However, the new test that I added to FATES testing in PR #991 (e542dfe) has uncovered a restart problem that remained hidden before because tests were short. I will piggy back on this issue to document the new problem, rather than opening a new issue with the same title.

I have confirmed that the restart problem exists outside of PR #991 by submitting this 12-month test from fates_next_api:
ERS_D_Lm12.1x1_brazil.I2000Clm50FatesCruGs.cheyenne_intel.clm-Fates
File ERS_D_Lm12.1x1_brazil.I2000Clm50FatesCruGs.cheyenne_intel.clm-Fates.20200607_200330_6g49jh.clm2.h0.0002-01-01-00000.nc.base.cprnc.out
gives the following summary:
A total number of 89 fields were compared
of which 32 had non-zero differences
and 0 had differences in fill patterns
and 0 had different dimension sizes
A total number of 2 fields could not be analyzed
A total number of 0 time-varying fields on file 1 were not found on file 2.
A total number of 0 time-constant fields on file 1 were not found on file 2.
A total number of 0 time-varying fields on file 2 were not found on file 1.
A total number of 8 time-constant fields on file 2 were not found on file 1.
diff_test: the two files seem to be DIFFERENT

The largest normalized RMS difference is this:

NPLANT_SCPF (lndgrid,fates_levscpf,time) t_index = 1 1
12 156 ( 0, 145, 1) ( 0, 3, 1) ( 0, 145, 1) ( 0, 145, 1)
156 3.371194687500000E+05 0.000000000000000E+00 4.3E+04 3.371194687500000E+05 1.6E-03 3.371194687500000E+05
156 3.800990937500000E+05 0.000000000000000E+00 3.800990937500000E+05 3.800990937500000E+05
156 ( 0, 145, 1) ( 0, 3, 1)
avg abs field values: 2.418482421875000E+03 rms diff: 3.4E+03 avg rel diff(npos): 1.6E-03
2.685921875000000E+03 avg decimal digits(ndif): 3.5 worst: 0.9
RMS NPLANT_SCPF 3.4419E+03 NORMALIZED 1.3486E+00

That's all I've got.

slevis-lmwg added a commit to slevis-lmwg/ctsm that referenced this issue Jun 8, 2020
@ekluzek ekluzek added the next this should get some attention in the next week or two. Normally each Thursday SE meeting. label Jun 17, 2020
@billsacks billsacks removed the next this should get some attention in the next week or two. Normally each Thursday SE meeting. label Jun 22, 2020
glemieux pushed a commit to glemieux/ctsm that referenced this issue Aug 4, 2020
@glemieux glemieux linked a pull request Nov 16, 2023 that will close this issue
@glemieux
Copy link
Collaborator

This appears to have been fixed by #2199.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug something is working incorrectly FATES API update Changes to the FATES version that also REQUIRE an API change in CTSM
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

5 participants