Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fatal RRM bug in atmosphere dycore #684

Closed
mt5555 opened this issue Feb 8, 2016 · 7 comments · Fixed by #686
Closed

fatal RRM bug in atmosphere dycore #684

mt5555 opened this issue Feb 8, 2016 · 7 comments · Fixed by #686
Assignees

Comments

@mt5555
Copy link
Contributor

mt5555 commented Feb 8, 2016

RRM currently not working due to a bug in the HOMME dycore. It appears this bug dates back to a regression in r4602 in the HOMME NCAR subversion repository. This code was inherited by ACME when we created the ACME HOMME branch and merged it into ACME fall 2015.

@mt5555 mt5555 self-assigned this Feb 8, 2016
@mt5555
Copy link
Contributor Author

mt5555 commented Feb 8, 2016

current status, using the ACME version of standalone HOMME (dry dycore only simulations):

  1. shallow water 1 layer: working correctly with RRM meshes
  2. shallow water with 4 layers: segfault
  3. hydrostatic equations: segfault

This suggests that most of the RRM infrastructure is working correctly, but there must be an issue related to pack/unpack of data with the "k" layer index.

@tangq
Copy link
Contributor

tangq commented Feb 8, 2016

@mt5555 , I did some RRM simulations in fall 2015, so I hope some of the information is helpful.
The last version I used is v0.3-138-g61acbde and it worked fine. This version does NOT have the SGH fix.
@wlin7 may have some simulations with newer version?

@mt5555
Copy link
Contributor Author

mt5555 commented Feb 8, 2016

The SGH fix was just an update to the SGH field in the topo boundary condition file, so it wasn't any code changes.

I'm pretty sure this regression happened when we brought in the ACME branch of HOMME:

17731cc Merge branch 'jgfouca/homme/bring_in_as_subtree_attempt_2' into master (PR #362)

Before that, we had an old version of HOMME inherited from CESM beta10.

I will work on debugging it. I think the best way is to compare shallow water with 1 layer and with 4 layers. In the 4 layer case, all 4 layers should be identical, so finding out where they depart should lead to the buggy routine.

@tangq
Copy link
Contributor

tangq commented Feb 8, 2016

That's great. We know where to look for the bug. Although debugging this problem is beyond my specialty, let me know if you think I can help in solving it.

@ghost
Copy link

ghost commented Feb 9, 2016

In my case I see it stops here in the atm.log:

creating cube topology...
Set up grid vertex from mesh...

Whereas, in the cesm.log I get this:

12: [PE_12]: cpumask set to 4 cpus on nid00012, cpumask = 000000000000000000000011000000000000000000000011
12: forrtl: severe (408): fort: (2): Subscript #1 of the array NBRS_PTR has value 10 which is greater than the upper bound of 9
12:
12: Image PC Routine Line Source
12: cesm.exe 0000000006D39330 Unknown Unknown Unknown
12: cesm.exe 000000000315D2BC cube_mod_mp_cubes 2307 cube_mod.F90
12: cesm.exe 0000000000B47E5E mesh_mod_mp_meshc 1255 mesh_mod.F90
12: cesm.exe 000000000405A1FB prim_driver_mod_m 259 prim_driver_mod.F90
12: cesm.exe 000000000322A849 dyn_comp_mp_dyn_i 177 dyn_comp.F90
12: cesm.exe 0000000000A3F711 inital_mp_cam_ini 31 inital.F90
12: cesm.exe 00000000007B25DB cam_comp_mp_cam_i 159 cam_comp.F90
12: cesm.exe 000000000076FA77 atm_comp_mct_mp_a 262 atm_comp_mct.F90
12: cesm.exe 0000000000449236 component_mod_mp_ 229 component_mod.F90
12: cesm.exe 0000000000412754 cesm_comp_mod_mp_ 1145 cesm_
12: comp_mod.F90
12: cesm.exe 0000000000439ACA MAIN__ 102 cesm_driver.F90
12: cesm.exe 000000000040090E Unknown Unknown Unknown
12: cesm.exe 0000000006E12421 Unknown Unknown Unknown
12: cesm.exe 00000000004007F5 Unknown Unknown Unknown

@mt5555
Copy link
Contributor Author

mt5555 commented Feb 9, 2016

That's clearly a bug. But it appears to be harmless because "ii" and np0 are never used. It seems all the code doing out-of=bounds array lookups should be deleted:

ii=Edge%tail_face

!map to correct location - for now all on same nbr side have same wgt, so take the first o$
ii = Edge%tail%nbrs_ptr(ii)

np0 = Edge%tail%nbrs_wgt(ii)

mt5555 added a commit that referenced this issue Feb 10, 2016
fixes #684

indexing bug when a corner node was coupled to more than 1 element (for RRM grids)
@mt5555
Copy link
Contributor Author

mt5555 commented Feb 10, 2016

#686

jgfouca added a commit that referenced this issue Feb 23, 2016
bugfix in pack and unpack when using RRM

Fixed an indexing bug when a corner node was coupled to more than 1
element (for RRM grids), and then added tests to prevent these types
of bugs in the future.

new tests:
RRM grids
template generation code
SE-SL test using baroCamMoist executable
Held-Suarez test case
BFB with different number of threads

HOMME regression test expected failures:
swtc1 PASS
swtc2 (new RRM test)
swtc5 PASS
swtc6 rebaseline needed (changed map, added new subtest)
baro2b PASS
baro2c rebaseline needed (switched from leapfrog to RK)
baroCamMoist PASS
baroCamMoist-SL (new test)
templates (new test)

fixes #684
fixes #488

* origin/mt5555/homme/rrmfix:
  moved openMP test from expensive baro2b to cheap baroCamMoist
  splitting baroCamMoist into two seperate tests
  adding forgotten namelist file for new test
  removing forgotten debug print statements
  a few more tweaks to baroCAM test case (it was too slow)
  adding SE-SL tracer test to HOMME regression suite
  bug fix in cprnc utility when number of times dont match
  allow optional cprnc RMS tolerance to pass some tests
  update README
  adding capability to call cprnc on two subtest results
  updated baro2c test to use RK instead of LEAPFROG
  bugfix: HOMME's make check was overwriting baseline generation scripts
  adding test of the mesh/scrip file generation capability
  adding ultra-low-res RRM test case using swtc6
  Adding RRM test case to HOMME regression suite
  comment out edge rotation initialization code
  bugfix in pack and unpack when using RRM
ghost pushed a commit that referenced this issue Feb 23, 2016
fixes #684

indexing bug when a corner node was coupled to more than 1 element (for RRM grids)
jgfouca added a commit that referenced this issue Feb 26, 2016
This is the second merge for this PR.

bugfix in pack and unpack when using RRM

Fixed an indexing bug when a corner node was coupled to more than 1
element (for RRM grids), and then added tests to prevent these types
of bugs in the future.

new tests:
RRM grids
template generation code
SE-SL test using baroCamMoist executable
Held-Suarez test case
BFB with different number of threads

HOMME regression test expected failures:
swtc1 PASS
swtc2 (new RRM test)
swtc5 PASS
swtc6 rebaseline needed (changed map, added new subtest)
baro2b PASS
baro2c rebaseline needed (switched from leapfrog to RK)
baroCamMoist PASS
baroCamMoist-SL (new test)
templates (new test)

fixes #684
fixes #488

* origin/mt5555/homme/rrmfix:
  bug fix - array copy bounds error
jgfouca added a commit that referenced this issue Feb 27, 2016
bugfix in pack and unpack when using RRM

Fixed an indexing bug when a corner node was coupled to more than 1
element (for RRM grids), and then added tests to prevent these types
of bugs in the future.

new tests:
RRM grids
template generation code
SE-SL test using baroCamMoist executable
Held-Suarez test case
BFB with different number of threads

HOMME regression test expected failures:
swtc1 PASS
swtc2 (new RRM test)
swtc5 PASS
swtc6 rebaseline needed (changed map, added new subtest)
baro2b PASS
baro2c rebaseline needed (switched from leapfrog to RK)
baroCamMoist PASS
baroCamMoist-SL (new test)
templates (new test)

fixes #684
fixes #488

[non-BFB]

* origin/mt5555/homme/rrmfix:
  Remove "phys_area" from output in HOMME template test
  bug fix - array copy bounds error
  moved openMP test from expensive baro2b to cheap baroCamMoist
  splitting baroCamMoist into two seperate tests
  adding forgotten namelist file for new test
  removing forgotten debug print statements
  a few more tweaks to baroCAM test case (it was too slow)
  adding SE-SL tracer test to HOMME regression suite
  bug fix in cprnc utility when number of times dont match
  allow optional cprnc RMS tolerance to pass some tests
  update README
  adding capability to call cprnc on two subtest results
  updated baro2c test to use RK instead of LEAPFROG
  bugfix: HOMME's make check was overwriting baseline generation scripts
  adding test of the mesh/scrip file generation capability
  adding ultra-low-res RRM test case using swtc6
  Adding RRM test case to HOMME regression suite
  comment out edge rotation initialization code
  bugfix in pack and unpack when using RRM
jgfouca added a commit that referenced this issue Feb 27, 2018
bugfix in pack and unpack when using RRM

Fixed an indexing bug when a corner node was coupled to more than 1
element (for RRM grids), and then added tests to prevent these types
of bugs in the future.

new tests:
RRM grids
template generation code
SE-SL test using baroCamMoist executable
Held-Suarez test case
BFB with different number of threads

HOMME regression test expected failures:
swtc1 PASS
swtc2 (new RRM test)
swtc5 PASS
swtc6 rebaseline needed (changed map, added new subtest)
baro2b PASS
baro2c rebaseline needed (switched from leapfrog to RK)
baroCamMoist PASS
baroCamMoist-SL (new test)
templates (new test)

fixes #684
fixes #488

[non-BFB]

* origin/mt5555/homme/rrmfix:
  Remove "phys_area" from output in HOMME template test
  bug fix - array copy bounds error
  moved openMP test from expensive baro2b to cheap baroCamMoist
  splitting baroCamMoist into two seperate tests
  adding forgotten namelist file for new test
  removing forgotten debug print statements
  a few more tweaks to baroCAM test case (it was too slow)
  adding SE-SL tracer test to HOMME regression suite
  bug fix in cprnc utility when number of times dont match
  allow optional cprnc RMS tolerance to pass some tests
  update README
  adding capability to call cprnc on two subtest results
  updated baro2c test to use RK instead of LEAPFROG
  bugfix: HOMME's make check was overwriting baseline generation scripts
  adding test of the mesh/scrip file generation capability
  adding ultra-low-res RRM test case using swtc6
  Adding RRM test case to HOMME regression suite
  comment out edge rotation initialization code
  bugfix in pack and unpack when using RRM
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants