Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Balance check error with the new B1850BPRPL45BGC compset #102

Closed
ekluzek opened this issue Dec 16, 2017 · 12 comments
Closed

Balance check error with the new B1850BPRPL45BGC compset #102

ekluzek opened this issue Dec 16, 2017 · 12 comments
Labels
bug something is working incorrectly

Comments

@ekluzek
Copy link
Collaborator

ekluzek commented Dec 16, 2017

Erik Kluzek < erik > - 2014-08-15 15:54:33 -0600
Bugzilla Id: 2027
Bugzilla Depends: 2026,
Bugzilla CC: dlawren, jshollen, klindsay, muszala, mvertens, rfisher, sacks,

We added a new compset B1850BPRPL45BGC that blows up shortly after initialization with a balance check error.

300: memory_write: model date = 10101 82800 memory = 272.46 MB (highwater) -0.00 MB (usage) (pe= 300 comps= OCN)
300: memory_write: model date = 10101 84600 memory = 272.46 MB (highwater) -0.00 MB (usage) (pe= 300 comps= OCN)
300: memory_write: model date = 10102 0 memory = 272.46 MB (highwater) -0.00 MB (usage) (pe= 300 comps= OCN)
102: WARNING: snow balance error
100: WARNING: snow balance error
100:forrtl: severe (40): recursive I/O operation, unit 6, file unknown
100:Image PC Routine Line Source
100:libirc.so 00002AE45FB70A1E Unknown Unknown Unknown
100:libirc.so 00002AE45FB6F4B6 Unknown Unknown Unknown
100:cesm.exe 000000000237A6F2 Unknown Unknown Unknown
100:cesm.exe 00000000022FB33C Unknown Unknown Unknown
100:cesm.exe 000000000235542C Unknown Unknown Unknown
100:cesm.exe 0000000001CBCE47 shr_sys_mod_mp_pr 498 shr_sys_mod.F90
100:cesm.exe 0000000001CBD590 shr_sys_mod_mp_sh 280 shr_sys_mod.F90
100:cesm.exe 0000000001101A2B decompmod_mp_get_ 209 decompMod.F90
100:cesm.exe 00000000013E02A1 getglobalvaluesmo 44 GetGlobalValuesMod.F90
100:cesm.exe 00000000012467B6 balancecheckmod_m 439 BalanceCheckMod.F90
100:cesm.exe 0000000000FC79BA clm_driver_mp_clm 544 clm_driver.F90
100:libiomp5.so 00002AE46003D4F3 Unknown Unknown Unknown
102:forrtl: severe (40): recursive I/O operation, unit 6, file unknown
102:Image PC Routine Line Source
102:libirc.so 00002B0CA5E86A1E Unknown Unknown Unknown
102:libirc.so 00002B0CA5E854B6 Unknown Unknown Unknown
102:cesm.exe 000000000237A6F2 Unknown Unknown Unknown
102:cesm.exe 00000000022FB33C Unknown Unknown Unknown
102:cesm.exe 000000000235542C Unknown Unknown Unknown
102:cesm.exe 0000000001CBCE47 shr_sys_mod_mp_pr 498 shr_sys_mod.F90
102:cesm.exe 0000000001CBD590 shr_sys_mod_mp_sh 280 shr_sys_mod.F90
102:cesm.exe 0000000001101A2B decompmod_mp_get_ 209 decompMod.F90
102:cesm.exe 00000000013E02A1 getglobalvaluesmo 44 GetGlobalValuesMod.F90
102:cesm.exe 00000000012467B6 balancecheckmod_m 439 BalanceCheckMod.F90
102:cesm.exe 0000000000FC79BA clm_driver_mp_clm 544 clm_driver.F90
102:libiomp5.so 00002B0CA63534F3 Unknown Unknown Unknown
100:INFO: 0031-306 pm_atexit: pm_exit_value is 40.
102:INFO: 0031-306 pm_atexit: pm_exit_value is 40.

This is with cesm1_3_alpha12b externals and

scripts https://svn-ccsm-models.cgd.ucar.edu/scripts/trunk_tags/scripts4_140814a
scripts/ccsm_utils/Machines https://svn-ccsm-models.cgd.ucar.edu/Machines/trunk_tags/Machines_140811

out of the box for a SMS.f09_g16.B1850BPRPL45BGC.yellowstone_intel case.

@ekluzek ekluzek added this to the clm5 milestone Dec 16, 2017
@ekluzek
Copy link
Collaborator Author

ekluzek commented Dec 16, 2017

Erik Kluzek < erik > - 2014-08-15 16:42:00 -0600

The abort seems to be on this line in BalanceCheck...

   write(iulog,*)'nstep= ',nstep, &
        ' local indexc= ',indexc, &
        ' global indexc= ',GetGlobalIndex(decomp_index=indexc, clmlevel=namec)

But, then GetGlobalIndex calls get_proc_bounds, and ends up trying to
execute the line...

#ifdef _OPENMP
if ( OMP_GET_NUM_THREADS() > 1 )then
call shr_sys_abort( trim(subname)//' ERROR: Calling from inside a threaded region')
end if
#endif

so then you are trying to do a shr_sys_abort that writes out an error message that is itself a write statement. So the OS blows up in your face.

So one change is to do send the results of GetGlobalIndex to an integer temporary that you write out. The other issue is that this is being done from within a threaded region and it shouldn't be.

So another way to get around this would be change the PE layout to NOT be threaded. Right now it's sending 2 threads to every component. I'm trying a case with only 1 thread per component, and we'll see what it does.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Dec 16, 2017

Erik Kluzek < erik > - 2014-08-15 17:25:06 -0600

Using a non-threaded layout (300x1) the compset is now running. The snow balance warning at the beginning is...

275: WARNING: snow balance error
275: nstep= 0 local indexc= 233779 global indexc= 167423
275: ctype= 75 ltype= 9 errh2osno= 9.359429059187598E-003
278: WARNING: snow balance error
278: nstep= 0 local indexc= 236445 global indexc= 150425
278: ctype= 75 ltype= 9 errh2osno= 9.917991872539743E-003
1: Opened file

Note, that I also made the change in 2026, as I thought that might also be a problem here as well.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Dec 16, 2017

Erik Kluzek < erik > - 2014-08-15 19:31:06 -0600

Yep, non-threaded layout PASS'es...

PASS SMS_D_P300x1.f09_g16.B1850BPRPL45BGC.yellowstone_intel
PASS SMS_D_P300x1.f09_g16.B1850BPRPL45BGC.yellowstone_intel.memleak

@ekluzek
Copy link
Collaborator Author

ekluzek commented Dec 16, 2017

Erik Kluzek < erik > - 2014-08-16 14:25:26 -0600

IT turns out the aux_clm_short tests on goldbach for intel, nag and pgi show this same problem with the tests...

PET_P16x2_D.f10_f10.I1850CLM45BGC.goldbach_nag.clm-ciso

(for each compiler in turn).

For the nag compiler, the build of MCT fails, because of an issue with the gcc compiler...

configure:2398: checking whether the C compiler works
configure:2420: gcc -g -Wl,--as-needed,--allow-shlib-undefined -DLINUX -DMCT_INTERFACE -DHAVE_MPI -DTHREADED_OMP -DFORTRANUNDERSCORE -DNO_CRAY_POINTERS -DNO_SHR_VMATH -DNO_C_SIZEOF -DLINUX -DCPRNAG -DHAVE_SLASHPROC -I.. -I. -I/scratch/cluster/erik/sharedlibroot.140814-235435/nag/openmpi/debug/threads/include -I/scratch/cluster/erik/sharedlibroot.140814-235435/nag/openmpi/debug/threads/MCT/noesmf/a1l1r1i1o1g1w1/csm_share -I/usr/local/netcdf-gcc-nag/include -I/usr/local/openmpi-gcc-nag/include -I/scratch/cluster/erik/sharedlibroot.140814-235435/nag/openmpi/debug/threads/include -I/fs/cgd/data0/erik/clm_beta12ext/models/csm_share/shr -L/home/santos/lib/fake_omp -lfake_omp -Wl,-Wl,,--rpath=/home/santos/lib/fake_omp conftest.c -L/usr/local/netcdf_c-4.3.0_f-4.4-beta1-gcc-g++-4.4.7-3-nag-5.3.1-907/lib -lnetcdff -L/usr/local/hdf5/lib -lnetcdf -lnetcdf -L/usr/local/nag-5.3.1-907/lib/NAG_Fortran -lf53 >&5
/usr/bin/ld: unrecognized option '-Wl'
/usr/bin/ld: use the --help option for usage information
collect2: ld returned 1 exit status

@ekluzek
Copy link
Collaborator Author

ekluzek commented Dec 16, 2017

Erik Kluzek < erik > - 2014-08-22 10:36:20 -0600

OK, I thought this was coming from a non-threaded region -- but it's NOT. Balance check is inside a threaded loop in clm_driver. So I think this is correctly identifying a problem.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Dec 16, 2017

Erik Kluzek < erik > - 2014-08-22 17:53:40 -0600

OK, I was able to make some changes to the threaded version to get it to work. The problem is that GetGlobalIndex is figuring out the bounds (and assuming processor bounds, over clump bounds), rather than having the bounds sent into it. If you send bounds into GetGlobalIndex, it works. That ended up being a lot harder than I wanted it to, but I was able to do that, and now it works. The code diff is about a thousand lines, so I won't show the total, but here's an example...

-  function initial_template_col_soil(c_new) result(c_template)
+  function initial_template_col_soil(bounds, c_new) result(c_template)
     !
     ! !DESCRIPTION:
     ! Find column to use as a template for a vegetated column that has newly become active.
@@ -148,8 +148,9 @@
     use clm_varcon, only : ispval
     !
     ! !ARGUMENTS:
-    integer              :: c_template ! function result
-    integer , intent(in) :: c_new        ! column index that needs initialization
+    type(bounds_type) , intent(in) :: bounds     ! bounds
+    integer                        :: c_template ! function result
+    integer           , intent(in) :: c_new      ! column index that needs initialization
     !
     ! !LOCAL VARIABLES:
     
@@ -159,7 +160,7 @@
     if (col%wtgcell(c_new) > 0._r8) then
        write(iulog,*) subname// ' ERROR: Expectation is that the only vegetated columns that&
             & can newly become active are ones with 0 weight on the grid cell'
-       call endrun(decomp_index=c_new, clmlevel=namec, msg=errMsg(__FILE__, __LINE__))
+       call endrun(bounds, decomp_index=c_new, clmlevel=namec, msg=errMsg(__FILE__, __LINE__))
     end if
 
     c_template = ispval

Bill might be concerned, that the above might break unit-tests, but I will look at and make sure they can still work.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Dec 16, 2017

Erik Kluzek < erik > - 2014-11-18 13:19:17 -0700

This issue shows up in cesm1_3_alpha14d with the following tests that fail...

FAIL SMS_D.f09_g16.BPIPDC5L45BGC.edison_intel

FAIL ERI_PT.f09_g16.B1850C5L45BGC.edison_intel

RUN ERS.f09_g16.BPIPDC5L45BGC.edison_intel.GC.cesm1_3_alpha14d

RUN ERS_PT.f09_g16.BRCP85C5L45BGC.edison_intel.GC.cesm1_3_alpha14d

and the same tests fail on yellowstone_intel.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Dec 16, 2017

Bill Sacks < sacks > - 2014-11-20 14:40:36 -0700

I think the solution is to add an optional argument to get_proc_bounds, like 'force', and if that's provided, then it gets the proc bounds even if you're in a threaded region.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Dec 16, 2017

Bill Sacks < sacks > - 2014-11-20 14:41:02 -0700

Mariana put a temporary workaround in her branch to become r097.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Dec 16, 2017

Erik Kluzek < erik > - 2015-04-03 15:32:07 -0600

The new test: ERP_P15x2_D_Ld5.f10_f10.I1850CLM45BGC.goldbach_nag.clm-ciso shows this same problem.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Dec 16, 2017

Erik Kluzek < erik > - 2016-01-07 16:14:35 -0700

Still seems to be an issue on hobart_nag as of clm4_5_6_r158, with the following notes in expected fails file.

FAIL PET_P12x2_D.f10_f10.I1850CLM45BGC.hobart_intel.clm-ciso

CFAIL PET_P12x2_D.f10_f10.I1850CLM45BGC.hobart_nag.clm-ciso

FAIL PET_P12x2_D.f10_f10.I1850CLM45BGC.hobart_pgi.clm-ciso

However, we removed all the threaded ciso tests from hobart_nag.

But, we do have a threaded test on yellowstone that is working (as of clm4_5_7_r164)...

ERP_P15x2_D_Ld5.f10_f10.I1850CLM45BGC.yellowstone_pgi.clm-ciso

billsacks added a commit to billsacks/ctsm that referenced this issue May 8, 2018
04273058 Merge pull request ESCOMP#103 from billsacks/no_logging
9bb46aa5 Make no-logging be the default
9af6b021 Merge pull request ESCOMP#102 from billsacks/explain_qmark
7f973ae3 Run through make style
d077a57d Add message describing meaning of '?'
60fc03b7 Merge pull request ESCOMP#101 from ESMCI/catch_svn_error
28073ec4 add exception class
4fb7e47f catch errors from svn status --xml
bfa48312 Merge pull request ESCOMP#98 from billsacks/quieter
7d12650b make style
afb4f115 Make more git and svn commands quieter

git-subtree-dir: manage_externals
git-subtree-split: 04273058c297127927f0fc85eed1cdc33e1a3af3
@ekluzek ekluzek removed this from the clm5 milestone Aug 11, 2019
@ekluzek ekluzek added bug something is working incorrectly next this should get some attention in the next week or two. Normally each Thursday SE meeting. and removed next this should get some attention in the next week or two. Normally each Thursday SE meeting. labels Aug 11, 2019
@ekluzek
Copy link
Collaborator Author

ekluzek commented Aug 12, 2019

The following test now passes in ctsm1.0.dev055...

PET_D_Ld10_P48x2.f10_f10_musgs.IHistClm50BgcCrop.hobart_nag.clm-ciso_decStart

so I'm marking this closed as it looks like it's not a problem anymore.

@ekluzek ekluzek closed this as completed Aug 12, 2019
slevis-lmwg pushed a commit to slevis-lmwg/ctsm that referenced this issue Dec 22, 2022
fde04e4 Merge pull request ESCOMP#138 from billsacks/add_python38_tests
37e4c4a Do not update dictionary in-place in loop
7e8474b Remove testing on mac os
7f41c56 Fix pylint issue
3065b0d Add travis-ci tests with python3.7 and python3.8
34fbf55 Add support for git sparse checkout
6c6ef9f Fix pylint errors
6a659ad Added test for sparse checkout and updated documentation
1443243 Support for git sparsecheckout via read-tree.
a48558d Merge pull request ESCOMP#119 from gold2718/submodules
f72ffe7 Do not try git submodule update if no .gitmodules file (git bug)
804e0af Fix a pylint error
45aef95 Addressed review concerns
7da5031 New capability to use git submodule information to checkout externals
1926530 Merge pull request ESCOMP#118 from mnlevy1981/svn_switch
b1b028d Updates after testing
9ea73e6 Add --svn-ignore-ancestry argument
fc5acda Merge pull request ESCOMP#114 from billsacks/fix_large_output_hang
aa2eb71 Try getting travis-ci working on MacOS
96842b4 Fix pylint errors
813fe3c pylint: disable useless-object-inheritance
c49d878 Rework execute_subprocess timeout handling
8fc0e5f Cleanup from 'make style'
b0b23a6 Merge pull request ESCOMP#110 from gold2718/help_fix
3cbcd16 Fixed and clarified help documentation
025e6cb Merge pull request ESCOMP#107 from jedwards4b/ignore_empty_git_dir
489842b if you encounter an empty directory clone into it
0c5a2f6 Merge pull request ESCOMP#106 from billsacks/remove_logfile_message
7799e99 Remove message about checking the log file for more details
0427305 Merge pull request ESCOMP#103 from billsacks/no_logging
9bb46aa Make no-logging be the default
9af6b02 Merge pull request ESCOMP#102 from billsacks/explain_qmark
7f973ae Run through make style
d077a57 Add message describing meaning of '?'
60fc03b Merge pull request ESCOMP#101 from ESMCI/catch_svn_error
28073ec add exception class
4fb7e47 catch errors from svn status --xml
bfa4831 Merge pull request ESCOMP#98 from billsacks/quieter
7d12650 make style
afb4f11 Make more git and svn commands quieter
a465b4f add --quiet argument to improve performance
b2f3ae8 Merge pull request ESCOMP#83 from jedwards4b/jedwards/components_arg
3f4c88f fix comment
c1b5b09 remove unneeded logic
4fdf180 one more test
f78d60f another test
bf52ac6 add a test
91d4851 fix pylint issue
987df5a only use components if populated
98a810d add a components arg to checkout only select components
6923119 Merge pull request ESCOMP#90 from ESMCI/issue-86-detached-sync-status
b11ad61 Merge branch 'master' into issue-86-detached-sync-status
3b624cf Merge pull request ESCOMP#93 from billsacks/work_on_coverage
2562830 Run a single coverage command rather than two separate commands
d1de5f8 Return to starting directory after each test
144f7d9 Merge pull request ESCOMP#92 from billsacks/point_to_esmci
58b8d3e Point to location of repository
0b46d81 Point to correct location for build/coverage status
a385070 fix pylint problems
dcf17b6 make style cleanup
92d342c Rewrite _current_ref to use plumbing rather than parsing porcelain
ca0a5d3 Rework some git repository functions, and major rework of unit tests
719383e Remove commented-out pdb.set_trace() call
376c780 Bugfix: detect and report 'detached from' correctly
21813e9 Add system test demonstrating failure to detect out of sync status.
1a7c59d Merge documentation update into master.
f1e9e99 Merge schema support for git hashes into master.
247fee1 Document return values of checkout.py: main
195c1d0 Implement explicit use of a hash for git repositories.
12dd743 Refactor: schema validation output
fdbc720 Bugfix: incorrect order of operations validing user input
d6423c6 Bugfix: timeout limit for subprocesses
7998f60 Update readme and help output
00b6fb2 Bugfix: add explicit schema version checking
0527869 Update readme
1ae8c84 Merge bugfix branch for stale subexternals into master.
b0c16d7 Bugfix: stale sub-externals after checkout.
30a4e44 Finish implementing system test for mixed-use externals
ac7ff96 Update mixed-use test repo.
bfda7b9 Bugfix: regexp for determining git tracking branches

git-subtree-dir: manage_externals
git-subtree-split: fde04e4d9a758b3aa277aa5fa44a59f5153f2958
samsrabin pushed a commit to samsrabin/CTSM that referenced this issue Apr 19, 2024
fixed log output encountered in ctsm
### Description of changes
Minor fix to output vector stream info

### Specific notes

Contributors other than yourself, if any:

CMEPS Issues Fixed (include github issue #):

Are there dependencies on other component PRs
 - [ ] CIME (list)
 - [ ] CMEPS (list)

Are changes expected to change answers?
 - [x] bit for bit
 - [ ] different at roundoff level
 - [ ] more substantial

Any User Interface Changes (namelist or namelist defaults changes)?
 - [ ] Yes
 - [x] No

Testing performed:
- [ ] (required) aux_cdeps
   - machines and compilers:
   - details (e.g. failed tests):
- [ ] (optional) CESM prealpha test
   - machines and compilers
   - details (e.g. failed tests):
 - found in running the aux_clm test suite with the nuopc cap

Hashes used for testing:
feature/stream_refactor in https://github.com/mvertens/CTSM
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug something is working incorrectly
Projects
None yet
Development

No branches or pull requests

1 participant