Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ch4 conservation error in ne30 1850 case... #135

Closed
ekluzek opened this issue Dec 16, 2017 · 7 comments
Closed

ch4 conservation error in ne30 1850 case... #135

ekluzek opened this issue Dec 16, 2017 · 7 comments
Labels
bug something is working incorrectly

Comments

@ekluzek
Copy link
Collaborator

ekluzek commented Dec 16, 2017

Erik Kluzek < erik > - 2016-03-09 14:25:12 -0700
Bugzilla Id: 2295
Bugzilla CC: andre, dlawren, fischer, oleson, rfisher, sacks,

The following test fails with cesm1_5_alpha06c

ERS_Ld7.ne30_g16.B1850.yellowstone_gnu.allactive-defaultio

227: CH4 Conservation Error in CH4Mod during diffusion, nstep, c, errch4 (mol /m^2.timestep)         266      201020                       NaN
227: Latdeg,Londeg=   53.286844012453230        57.510967220423055
227: ENDRUN: ERROR: CH4 Conservation Error in CH4Mod during diffusionERROR in /glade/scratch/erik/cesm1_5_alpha06c/components/clm/src/biogeochem/ch4Mod.F90 at line 3582                                                                                                                                                                                                                                                                                                                                                                  
227: ERROR: Unknown error submitted to shr_sys_abort.
227:#0  0x2B05BDE5EB57
227:#1  0x159E681 in __shr_sys_mod_MOD_shr_sys_backtrace
227:#2  0x159E9DA in __shr_sys_mod_MOD_shr_sys_abort
227:#3  0xDDB8AA in __abortutils_MOD_endrun_vanilla
227:#4  0x10925CA in __ch4mod_MOD_ch4_tran at ch4Mod.F90:0
227:#5  0x10A30BF in __ch4mod_MOD_ch4
227:#6  0xDE160C in __clm_driver_MOD_clm_drv._omp_fn.3 at clm_driver.F90:0
227:#7  0x2B05BE3F396E
227:#8  0xDE26DB in __clm_driver_MOD_clm_drv
227:#9  0xDD770B in __lnd_comp_mct_MOD_lnd_run_mct
227:#10  0x41E585 in __component_mod_MOD_component_run
227:#11  0x40C91A in __cesm_comp_mod_MOD_cesm_run
@ekluzek ekluzek added this to the clm5 milestone Dec 16, 2017
@ekluzek
Copy link
Collaborator Author

ekluzek commented Dec 16, 2017

Erik Kluzek < erik > - 2016-03-10 00:41:01 -0700

I tried replicating some cases for standalone CLM, and they worked...

PASS SMS_Ld7.ne30_g16.I1850CRUCLM50BGCCROP.yellowstone_gnu.clm-default

PASS SMS_Ld7.ne30_g16.IMCRUCLM50BGC.yellowstone_gnu.clm-default

I made the second one as close to the B1850 case as follows...

./xmlchange CLM_BLDNML_OPTS="-bgc bgc -crop"
./xmlchange CLM_NML_USE_CASE=1850_control,DATM_PRESAERO=clim_1850,MOSART_BLDNML_OPTS="-simyr 1850",DATM_CLMNCEP_YR_START=1901,DATM_CLMNCEP_YR_END=1920
./xmlchange CCSM_CO2_PPMV=284.7
and add to user_nl_clm
finidat = ''

@ekluzek
Copy link
Collaborator Author

ekluzek commented Dec 16, 2017

Erik Kluzek < erik > - 2016-03-10 12:04:08 -0700

Looking at the code where it fails, the problem is that errch4 is equal to nan, so the else gets matched and it aborts. So I need to see why errch4 is going nan. I'm trying a DEBUG=TRUE test case and we'll see what that does.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Dec 16, 2017

Erik Kluzek < erik > - 2016-03-14 12:14:52 -0600

All of my test DEBUG tests passed.

DONE ERS_D_Ld7.ne30_g16.B1850.yellowstone_gnu.allactive-defaultio : (test finished, successful coupler log)
--- Test Functionality ---:
PASS ERS_D_Ld7.ne30_g16.B1850.yellowstone_gnu.allactive-defaultio.cam.h0.nc : test compare cam.h0 (.base and .rest files)
PASS ERS_D_Ld7.ne30_g16.B1850.yellowstone_gnu.allactive-defaultio.cice.h.nc : test compare cice.h (.base and .rest files)
PASS ERS_D_Ld7.ne30_g16.B1850.yellowstone_gnu.allactive-defaultio.clm2.h0.nc : test compare clm2.h0 (.base and .rest files)
PASS ERS_D_Ld7.ne30_g16.B1850.yellowstone_gnu.allactive-defaultio.pop.h.nc : test compare pop.h (.base and .rest files)
PASS ERS_D_Ld7.ne30_g16.B1850.yellowstone_gnu.allactive-defaultio.cpl.hi.nc : test compare cpl.hi (.base and .rest files)
PASS ERS_D_Ld7.ne30_g16.B1850.yellowstone_gnu.allactive-defaultio : test functionality summary (ERS_test)
PASS ERS_D_Ld7.ne30_g16.B1850.yellowstone_gnu.allactive-defaultio.memleak
--- Test time is 1286 seconds ---
DONE ERS_D_Ld7.ne30_g16.I1850CRUCLM50BGCCROP.yellowstone_gnu.clm-default : (test finished, successful coupler log)
--- Test Functionality ---:
PASS ERS_D_Ld7.ne30_g16.I1850CRUCLM50BGCCROP.yellowstone_gnu.clm-default.clm2.h0.nc : test compare clm2.h0 (.base and .rest files)
PASS ERS_D_Ld7.ne30_g16.I1850CRUCLM50BGCCROP.yellowstone_gnu.clm-default.clm2.h1.nc : test compare clm2.h1 (.base and .rest files)
PASS ERS_D_Ld7.ne30_g16.I1850CRUCLM50BGCCROP.yellowstone_gnu.clm-default.cpl.hi.nc : test compare cpl.hi (.base and .rest files)
PASS ERS_D_Ld7.ne30_g16.I1850CRUCLM50BGCCROP.yellowstone_gnu.clm-default : test functionality summary (ERS_test)
PASS ERS_D_Ld7.ne30_g16.I1850CRUCLM50BGCCROP.yellowstone_gnu.clm-default.memleak
--- Test time is 528 seconds ---
DONE SMS_D_Ld7.ne30_g16.B1850.yellowstone_gnu.allactive-defaultio : (test finished, successful coupler log)
--- Test Functionality: ---
PASS SMS_D_Ld7.ne30_g16.B1850.yellowstone_gnu.allactive-defaultio : successful coupler log
PASS SMS_D_Ld7.ne30_g16.B1850.yellowstone_gnu.allactive-defaultio : test functionality summary
PASS SMS_D_Ld7.ne30_g16.B1850.yellowstone_gnu.allactive-defaultio.memleak
--- Test time is 751 seconds ---

@ekluzek
Copy link
Collaborator Author

ekluzek commented Dec 16, 2017

Ben Andre < andre > - 2016-03-14 12:17:22 -0600

gnu doesn't trap floating point exceptions and abort. You need to run an intel debug.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Dec 16, 2017

Erik Kluzek < erik > - 2016-03-14 12:49:38 -0600

(In reply to Ben Andre from comment #4)

gnu doesn't trap floating point exceptions and abort. You need to run an
intel debug.

OK thanks Ben. I am trying that now.

But, one problem with the gnu debug tests, is that they are running to completion rather than dieing in the same way. And that means errch4 isn't NaN for the DEBUG case, but IS NaN for the non-DEBUG case. So the behavior is different between the two cases, and that likely means there is a numerical error that is causing it to abort in one case and not in the other, because of a tiny roundoff level difference between the two. Which likely means this might be really hard to track down...

@billsacks billsacks added bug something is working incorrectly and removed severity: critical labels Nov 26, 2018
@ekluzek ekluzek removed this from the clm5 milestone Jul 7, 2019
@ekluzek
Copy link
Collaborator Author

ekluzek commented Jul 7, 2019

There aren't any recent tests of this case in the newest development versions of the code. So I don't know if it's still a problem or not. CLM does run a few ne30_g16 test cases, so I don't see problems there.

@billsacks
Copy link
Member

Closing this since we haven't seen the issue in a long time. We can reopen if we start seeing it again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug something is working incorrectly
Projects
None yet
Development

No branches or pull requests

2 participants