-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Balance check error with the new B1850BPRPL45BGC compset #102
Comments
Erik Kluzek < erik > - 2014-08-15 16:42:00 -0600 The abort seems to be on this line in BalanceCheck...
But, then GetGlobalIndex calls get_proc_bounds, and ends up trying to #ifdef _OPENMP so then you are trying to do a shr_sys_abort that writes out an error message that is itself a write statement. So the OS blows up in your face. So one change is to do send the results of GetGlobalIndex to an integer temporary that you write out. The other issue is that this is being done from within a threaded region and it shouldn't be. So another way to get around this would be change the PE layout to NOT be threaded. Right now it's sending 2 threads to every component. I'm trying a case with only 1 thread per component, and we'll see what it does. |
Erik Kluzek < erik > - 2014-08-15 17:25:06 -0600 Using a non-threaded layout (300x1) the compset is now running. The snow balance warning at the beginning is... 275: WARNING: snow balance error Note, that I also made the change in 2026, as I thought that might also be a problem here as well. |
Erik Kluzek < erik > - 2014-08-15 19:31:06 -0600 Yep, non-threaded layout PASS'es... PASS SMS_D_P300x1.f09_g16.B1850BPRPL45BGC.yellowstone_intel |
Erik Kluzek < erik > - 2014-08-16 14:25:26 -0600 IT turns out the aux_clm_short tests on goldbach for intel, nag and pgi show this same problem with the tests... PET_P16x2_D.f10_f10.I1850CLM45BGC.goldbach_nag.clm-ciso (for each compiler in turn). For the nag compiler, the build of MCT fails, because of an issue with the gcc compiler...
|
Erik Kluzek < erik > - 2014-08-22 10:36:20 -0600 OK, I thought this was coming from a non-threaded region -- but it's NOT. Balance check is inside a threaded loop in clm_driver. So I think this is correctly identifying a problem. |
Erik Kluzek < erik > - 2014-08-22 17:53:40 -0600 OK, I was able to make some changes to the threaded version to get it to work. The problem is that GetGlobalIndex is figuring out the bounds (and assuming processor bounds, over clump bounds), rather than having the bounds sent into it. If you send bounds into GetGlobalIndex, it works. That ended up being a lot harder than I wanted it to, but I was able to do that, and now it works. The code diff is about a thousand lines, so I won't show the total, but here's an example... - function initial_template_col_soil(c_new) result(c_template)
+ function initial_template_col_soil(bounds, c_new) result(c_template)
!
! !DESCRIPTION:
! Find column to use as a template for a vegetated column that has newly become active.
@@ -148,8 +148,9 @@
use clm_varcon, only : ispval
!
! !ARGUMENTS:
- integer :: c_template ! function result
- integer , intent(in) :: c_new ! column index that needs initialization
+ type(bounds_type) , intent(in) :: bounds ! bounds
+ integer :: c_template ! function result
+ integer , intent(in) :: c_new ! column index that needs initialization
!
! !LOCAL VARIABLES:
@@ -159,7 +160,7 @@
if (col%wtgcell(c_new) > 0._r8) then
write(iulog,*) subname// ' ERROR: Expectation is that the only vegetated columns that&
& can newly become active are ones with 0 weight on the grid cell'
- call endrun(decomp_index=c_new, clmlevel=namec, msg=errMsg(__FILE__, __LINE__))
+ call endrun(bounds, decomp_index=c_new, clmlevel=namec, msg=errMsg(__FILE__, __LINE__))
end if
c_template = ispval Bill might be concerned, that the above might break unit-tests, but I will look at and make sure they can still work. |
Erik Kluzek < erik > - 2014-11-18 13:19:17 -0700 This issue shows up in cesm1_3_alpha14d with the following tests that fail... FAIL SMS_D.f09_g16.BPIPDC5L45BGC.edison_intel FAIL ERI_PT.f09_g16.B1850C5L45BGC.edison_intel RUN ERS.f09_g16.BPIPDC5L45BGC.edison_intel.GC.cesm1_3_alpha14d RUN ERS_PT.f09_g16.BRCP85C5L45BGC.edison_intel.GC.cesm1_3_alpha14d and the same tests fail on yellowstone_intel. |
Bill Sacks < sacks > - 2014-11-20 14:40:36 -0700 I think the solution is to add an optional argument to get_proc_bounds, like 'force', and if that's provided, then it gets the proc bounds even if you're in a threaded region. |
Bill Sacks < sacks > - 2014-11-20 14:41:02 -0700 Mariana put a temporary workaround in her branch to become r097. |
Erik Kluzek < erik > - 2015-04-03 15:32:07 -0600 The new test: ERP_P15x2_D_Ld5.f10_f10.I1850CLM45BGC.goldbach_nag.clm-ciso shows this same problem. |
Erik Kluzek < erik > - 2016-01-07 16:14:35 -0700 Still seems to be an issue on hobart_nag as of clm4_5_6_r158, with the following notes in expected fails file. FAIL PET_P12x2_D.f10_f10.I1850CLM45BGC.hobart_intel.clm-ciso CFAIL PET_P12x2_D.f10_f10.I1850CLM45BGC.hobart_nag.clm-ciso FAIL PET_P12x2_D.f10_f10.I1850CLM45BGC.hobart_pgi.clm-ciso However, we removed all the threaded ciso tests from hobart_nag. But, we do have a threaded test on yellowstone that is working (as of clm4_5_7_r164)... ERP_P15x2_D_Ld5.f10_f10.I1850CLM45BGC.yellowstone_pgi.clm-ciso |
04273058 Merge pull request ESCOMP#103 from billsacks/no_logging 9bb46aa5 Make no-logging be the default 9af6b021 Merge pull request ESCOMP#102 from billsacks/explain_qmark 7f973ae3 Run through make style d077a57d Add message describing meaning of '?' 60fc03b7 Merge pull request ESCOMP#101 from ESMCI/catch_svn_error 28073ec4 add exception class 4fb7e47f catch errors from svn status --xml bfa48312 Merge pull request ESCOMP#98 from billsacks/quieter 7d12650b make style afb4f115 Make more git and svn commands quieter git-subtree-dir: manage_externals git-subtree-split: 04273058c297127927f0fc85eed1cdc33e1a3af3
The following test now passes in ctsm1.0.dev055... PET_D_Ld10_P48x2.f10_f10_musgs.IHistClm50BgcCrop.hobart_nag.clm-ciso_decStart so I'm marking this closed as it looks like it's not a problem anymore. |
fde04e4 Merge pull request ESCOMP#138 from billsacks/add_python38_tests 37e4c4a Do not update dictionary in-place in loop 7e8474b Remove testing on mac os 7f41c56 Fix pylint issue 3065b0d Add travis-ci tests with python3.7 and python3.8 34fbf55 Add support for git sparse checkout 6c6ef9f Fix pylint errors 6a659ad Added test for sparse checkout and updated documentation 1443243 Support for git sparsecheckout via read-tree. a48558d Merge pull request ESCOMP#119 from gold2718/submodules f72ffe7 Do not try git submodule update if no .gitmodules file (git bug) 804e0af Fix a pylint error 45aef95 Addressed review concerns 7da5031 New capability to use git submodule information to checkout externals 1926530 Merge pull request ESCOMP#118 from mnlevy1981/svn_switch b1b028d Updates after testing 9ea73e6 Add --svn-ignore-ancestry argument fc5acda Merge pull request ESCOMP#114 from billsacks/fix_large_output_hang aa2eb71 Try getting travis-ci working on MacOS 96842b4 Fix pylint errors 813fe3c pylint: disable useless-object-inheritance c49d878 Rework execute_subprocess timeout handling 8fc0e5f Cleanup from 'make style' b0b23a6 Merge pull request ESCOMP#110 from gold2718/help_fix 3cbcd16 Fixed and clarified help documentation 025e6cb Merge pull request ESCOMP#107 from jedwards4b/ignore_empty_git_dir 489842b if you encounter an empty directory clone into it 0c5a2f6 Merge pull request ESCOMP#106 from billsacks/remove_logfile_message 7799e99 Remove message about checking the log file for more details 0427305 Merge pull request ESCOMP#103 from billsacks/no_logging 9bb46aa Make no-logging be the default 9af6b02 Merge pull request ESCOMP#102 from billsacks/explain_qmark 7f973ae Run through make style d077a57 Add message describing meaning of '?' 60fc03b Merge pull request ESCOMP#101 from ESMCI/catch_svn_error 28073ec add exception class 4fb7e47 catch errors from svn status --xml bfa4831 Merge pull request ESCOMP#98 from billsacks/quieter 7d12650 make style afb4f11 Make more git and svn commands quieter a465b4f add --quiet argument to improve performance b2f3ae8 Merge pull request ESCOMP#83 from jedwards4b/jedwards/components_arg 3f4c88f fix comment c1b5b09 remove unneeded logic 4fdf180 one more test f78d60f another test bf52ac6 add a test 91d4851 fix pylint issue 987df5a only use components if populated 98a810d add a components arg to checkout only select components 6923119 Merge pull request ESCOMP#90 from ESMCI/issue-86-detached-sync-status b11ad61 Merge branch 'master' into issue-86-detached-sync-status 3b624cf Merge pull request ESCOMP#93 from billsacks/work_on_coverage 2562830 Run a single coverage command rather than two separate commands d1de5f8 Return to starting directory after each test 144f7d9 Merge pull request ESCOMP#92 from billsacks/point_to_esmci 58b8d3e Point to location of repository 0b46d81 Point to correct location for build/coverage status a385070 fix pylint problems dcf17b6 make style cleanup 92d342c Rewrite _current_ref to use plumbing rather than parsing porcelain ca0a5d3 Rework some git repository functions, and major rework of unit tests 719383e Remove commented-out pdb.set_trace() call 376c780 Bugfix: detect and report 'detached from' correctly 21813e9 Add system test demonstrating failure to detect out of sync status. 1a7c59d Merge documentation update into master. f1e9e99 Merge schema support for git hashes into master. 247fee1 Document return values of checkout.py: main 195c1d0 Implement explicit use of a hash for git repositories. 12dd743 Refactor: schema validation output fdbc720 Bugfix: incorrect order of operations validing user input d6423c6 Bugfix: timeout limit for subprocesses 7998f60 Update readme and help output 00b6fb2 Bugfix: add explicit schema version checking 0527869 Update readme 1ae8c84 Merge bugfix branch for stale subexternals into master. b0c16d7 Bugfix: stale sub-externals after checkout. 30a4e44 Finish implementing system test for mixed-use externals ac7ff96 Update mixed-use test repo. bfda7b9 Bugfix: regexp for determining git tracking branches git-subtree-dir: manage_externals git-subtree-split: fde04e4d9a758b3aa277aa5fa44a59f5153f2958
fixed log output encountered in ctsm ### Description of changes Minor fix to output vector stream info ### Specific notes Contributors other than yourself, if any: CMEPS Issues Fixed (include github issue #): Are there dependencies on other component PRs - [ ] CIME (list) - [ ] CMEPS (list) Are changes expected to change answers? - [x] bit for bit - [ ] different at roundoff level - [ ] more substantial Any User Interface Changes (namelist or namelist defaults changes)? - [ ] Yes - [x] No Testing performed: - [ ] (required) aux_cdeps - machines and compilers: - details (e.g. failed tests): - [ ] (optional) CESM prealpha test - machines and compilers - details (e.g. failed tests): - found in running the aux_clm test suite with the nuopc cap Hashes used for testing: feature/stream_refactor in https://github.com/mvertens/CTSM
Erik Kluzek < erik > - 2014-08-15 15:54:33 -0600
Bugzilla Id: 2027
Bugzilla Depends: 2026,
Bugzilla CC: dlawren, jshollen, klindsay, muszala, mvertens, rfisher, sacks,
We added a new compset B1850BPRPL45BGC that blows up shortly after initialization with a balance check error.
300: memory_write: model date = 10101 82800 memory = 272.46 MB (highwater) -0.00 MB (usage) (pe= 300 comps= OCN)
300: memory_write: model date = 10101 84600 memory = 272.46 MB (highwater) -0.00 MB (usage) (pe= 300 comps= OCN)
300: memory_write: model date = 10102 0 memory = 272.46 MB (highwater) -0.00 MB (usage) (pe= 300 comps= OCN)
102: WARNING: snow balance error
100: WARNING: snow balance error
100:forrtl: severe (40): recursive I/O operation, unit 6, file unknown
100:Image PC Routine Line Source
100:libirc.so 00002AE45FB70A1E Unknown Unknown Unknown
100:libirc.so 00002AE45FB6F4B6 Unknown Unknown Unknown
100:cesm.exe 000000000237A6F2 Unknown Unknown Unknown
100:cesm.exe 00000000022FB33C Unknown Unknown Unknown
100:cesm.exe 000000000235542C Unknown Unknown Unknown
100:cesm.exe 0000000001CBCE47 shr_sys_mod_mp_pr 498 shr_sys_mod.F90
100:cesm.exe 0000000001CBD590 shr_sys_mod_mp_sh 280 shr_sys_mod.F90
100:cesm.exe 0000000001101A2B decompmod_mp_get_ 209 decompMod.F90
100:cesm.exe 00000000013E02A1 getglobalvaluesmo 44 GetGlobalValuesMod.F90
100:cesm.exe 00000000012467B6 balancecheckmod_m 439 BalanceCheckMod.F90
100:cesm.exe 0000000000FC79BA clm_driver_mp_clm 544 clm_driver.F90
100:libiomp5.so 00002AE46003D4F3 Unknown Unknown Unknown
102:forrtl: severe (40): recursive I/O operation, unit 6, file unknown
102:Image PC Routine Line Source
102:libirc.so 00002B0CA5E86A1E Unknown Unknown Unknown
102:libirc.so 00002B0CA5E854B6 Unknown Unknown Unknown
102:cesm.exe 000000000237A6F2 Unknown Unknown Unknown
102:cesm.exe 00000000022FB33C Unknown Unknown Unknown
102:cesm.exe 000000000235542C Unknown Unknown Unknown
102:cesm.exe 0000000001CBCE47 shr_sys_mod_mp_pr 498 shr_sys_mod.F90
102:cesm.exe 0000000001CBD590 shr_sys_mod_mp_sh 280 shr_sys_mod.F90
102:cesm.exe 0000000001101A2B decompmod_mp_get_ 209 decompMod.F90
102:cesm.exe 00000000013E02A1 getglobalvaluesmo 44 GetGlobalValuesMod.F90
102:cesm.exe 00000000012467B6 balancecheckmod_m 439 BalanceCheckMod.F90
102:cesm.exe 0000000000FC79BA clm_driver_mp_clm 544 clm_driver.F90
102:libiomp5.so 00002B0CA63534F3 Unknown Unknown Unknown
100:INFO: 0031-306 pm_atexit: pm_exit_value is 40.
102:INFO: 0031-306 pm_atexit: pm_exit_value is 40.
This is with cesm1_3_alpha12b externals and
scripts https://svn-ccsm-models.cgd.ucar.edu/scripts/trunk_tags/scripts4_140814a
scripts/ccsm_utils/Machines https://svn-ccsm-models.cgd.ucar.edu/Machines/trunk_tags/Machines_140811
out of the box for a SMS.f09_g16.B1850BPRPL45BGC.yellowstone_intel case.
The text was updated successfully, but these errors were encountered: