CH4 Conservation Error in CH4Mod during diffusion #260

blcc · 2021-04-29T09:15:37Z

Hi, I encounter an error when testing present NorESM code on Betzy, both master and noresm2 branch.

After git clone and checkout_externals (both Externals.cfg and Externals_continuous_development.cfg are same):

cime/scripts/create_newcase --case ~/work/noresm2_cases/noresm_test005 --compset NHIST --res f19_tn14 --mach betzy --project nn9039k 
cd ~/work/noresm2_cases/noresm_test005 && ./case.setup && ./case.build && ./case.submit

The job stopped when initialization, here is the cesm.log:

[skip]
MCT::m_Router::initp_: GSMap indices not increasing...Will correct
MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
MCT::m_Router::initp_: GSMap indices not increasing...Will correct
MCT::m_Router::initp_: GSMap indices not increasing...Will correct
MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
MCT::m_Router::initp_: GSMap indices not increasing...Will correct
(seq_domain_areafactinit) : min/max mdl2drv   0.999999999983463       1.00000000001725    areafact_o_OCN
(seq_domain_areafactinit) : min/max drv2mdl   0.999999999982750       1.00000000001654    areafact_o_OCN
(seq_domain_areafactinit) : min/max mdl2drv   0.999999999983463       1.00000000001725    areafact_i_ICE
(seq_domain_areafactinit) : min/max drv2mdl   0.999999999982750       1.00000000001654    areafact_i_ICE
 CH4 Conservation Error in CH4Mod during diffusion, nstep, c, errch4 (mol /m^2.t
 imestep)           0      107431                     NaN
 Latdeg,Londeg=   48.3157894736841        40.0000000000000     
 ENDRUN:
 ERROR: 
  ERROR: CH4 Conservation Error in CH4Mod during diffusionERROR in ch4Mod.F90 at
  line 3948
Image              PC                Routine            Line        Source             
cesm.exe           00000000029224C6  Unknown               Unknown  Unknown
cesm.exe           00000000025A6B80  shr_abort_mod_mp_         114  shr_abort_mod.F90
cesm.exe           0000000001B81AFF  abortutils_mp_end          50  abortutils.F90
cesm.exe           00000000021693E7  ch4mod_mp_ch4_tra        3947  ch4Mod.F90
cesm.exe           000000000215BE02  ch4mod_mp_ch4_           2045  ch4Mod.F90
cesm.exe           0000000001B8D091  clm_driver_mp_clm         960  clm_driver.F90
cesm.exe           0000000001B7689A  lnd_comp_mct_mp_l         456  lnd_comp_mct.F90
cesm.exe           00000000004376A0  component_mod_mp_         728  component_mod.F90
cesm.exe           000000000041B85B  cime_comp_mod_mp_        2724  cime_comp_mod.F90
cesm.exe           00000000004372E7  MAIN__                    125  cime_driver.F90
cesm.exe           0000000000419512  Unknown               Unknown  Unknown
libc-2.17.so       00002AE442507545  __libc_start_main     Unknown  Unknown
cesm.exe           0000000000419429  Unknown               Unknown  Unknown
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 143 in communicator MPI_COMM_WORLD
with errorcode 1001.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 179268.0 ON b4139 CANCELLED AT 2021-04-29T10:47:01 ***

I also tried Externals.cfg from older and workable NorESM, error is still exist.

Is it a bug or I use the wrong command or branch?

Thanks,
Ping-Gin

The text was updated successfully, but these errors were encountered:

adagj · 2021-04-29T09:32:39Z

@blcc
Hi, can you add README from this specific run? Or the release/tag you used? You should use master or noresm-release2.0.4 when running on betzy. For details, see
https://github.com/NorESMhub/NorESM/tree/noresm2
and
https://noresm-docs.readthedocs.io/en/noresm2/configurations/platforms.html

Remember that you need to rerun Externals.cfg when changing branch or tag.

Best regards,
Ada

blcc · 2021-04-29T09:45:20Z

@blcc
Hi, can you add README from this specific run? Or the release/tag you used? You should use master or noresm-release2.04 when running on betzy. For details, see
https://github.com/NorESMhub/NorESM/tree/noresm2
and
https://noresm-docs.readthedocs.io/en/noresm2/configurations/platforms.html

Remember that you need to rerun Externals.cfg when changing branch or tag.

Best regards,
Ada

Thank you Ada. The README.case is attached.
README.case.txt

I tried master but still same error. I will try release2.0.4.
Best regards,
Ping-Gin

adagj · 2021-04-29T09:54:15Z

Thanks, then it is probably not related to the betzy settings.
One other issue we have had on betzy which might be useful to check is corrupted files occurring when copying files to betzy, causing some nans in the files. Are you using restart files? Then you can check if the checksum is the same for the restart files you use on betzy and the ones stored on NIRD ( e.g. using sha256sum $FILENAME)

If that doesn't help; @DirkOlivie @monsieuralok maybe you can help?

blcc · 2021-04-29T10:56:23Z

release-noresm2.0.4 is tried, but same error.
Now I suspect the input data on Betzy is corrupted.
I'll check it later. Thanks.

DirkOlivie · 2021-04-29T15:54:21Z

Hi Ping-Gin,

if the error is still there, could you paste also the last lines of the lnd.log-file (the error is in the land component) in this issue?

In the land-model, there is currently a correction going to be applied : see
NorESMhub/CTSM#11
A pull-request is created, and this will probably soon be available in the code : see
NorESMhub/CTSM#12

This is possibly related to your problem, but I am not sure.

Best regards,
Dirk

blcc · 2021-05-03T07:13:58Z

Thanks @DirkOlivie, I did some tests with merged code but still same error.
However the problem disappeared if change NTASKS_LND from default 192 to 128.
I guess CTSM somehow has a bug when use NTASKS 192, make some points of t_soisno and some other variables NaN. Finally crashed at ch4Mod.F90.
The easiest way is changing default PE setting in cime_config/config_pes.xml, to avoid this problem. And add a warning about it in document or code.

Ping-Gin

adagj · 2021-05-03T07:27:38Z

@blcc thanks for the clarification, we will add it to the documentation-
When building, did you use the --pecount option (https://noresm-docs.readthedocs.io/en/noresm2/configurations/platforms.html#hpc-platforms section 4.1.2.1) ?
@monsieuralok can you make sure that the --pecount option uses 128 NTASKS for the land component?

blcc · 2021-05-03T07:43:46Z

Thanks @adagj, I did not use --pecount option, Seems it applied M set (8 nodes) automatically.

monsieuralok · 2021-05-31T10:13:28Z

@adagj @DirkOlivie I guess we should open this issue with CESM as reported earlier by others : ESCOMP/CTSM#135 It might be that it solved in newer version. But, it is difficult to check newer version in same framework.

DirkOlivie · 2021-06-02T09:08:11Z

I have experienced the same error "ERROR: CH4 Conservation Error in CH4Mod during diffusionERROR in ch4Mod.F90" in a NHIST (1850-2014) simulation.

It happened after 20 years, at the moment of automatic resubmission (1870-01-01).

Resubmitting manually gave the same error message.
Resubmitting after recopying the 1870-01-01 restart files from the archive-directory to the run-directory, gave the same problem.
Resubmitting after recopying earlier restart files (1860-01-01) worked, and the simulation also passed later without any problems 1870-01-01.

blcc · 2022-07-01T12:31:02Z

I'll close this issue since no one have same problem for a long time.
We can reopen it if happened again.

blcc closed this as completed Jul 1, 2022

gold2718 added this to NorESM Development Jun 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CH4 Conservation Error in CH4Mod during diffusion #260

CH4 Conservation Error in CH4Mod during diffusion #260

blcc commented Apr 29, 2021

adagj commented Apr 29, 2021 •

edited

Loading

blcc commented Apr 29, 2021

adagj commented Apr 29, 2021

blcc commented Apr 29, 2021

DirkOlivie commented Apr 29, 2021

blcc commented May 3, 2021 •

edited

Loading

adagj commented May 3, 2021

blcc commented May 3, 2021

monsieuralok commented May 31, 2021

DirkOlivie commented Jun 2, 2021

blcc commented Jul 1, 2022

CH4 Conservation Error in CH4Mod during diffusion #260

CH4 Conservation Error in CH4Mod during diffusion #260

Comments

blcc commented Apr 29, 2021

adagj commented Apr 29, 2021 • edited Loading

blcc commented Apr 29, 2021

adagj commented Apr 29, 2021

blcc commented Apr 29, 2021

DirkOlivie commented Apr 29, 2021

blcc commented May 3, 2021 • edited Loading

adagj commented May 3, 2021

blcc commented May 3, 2021

monsieuralok commented May 31, 2021

DirkOlivie commented Jun 2, 2021

blcc commented Jul 1, 2022

adagj commented Apr 29, 2021 •

edited

Loading

blcc commented May 3, 2021 •

edited

Loading