Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2-Moment Moist is non-zero-diff with MAPL develop #762

Open
mathomp4 opened this issue Mar 19, 2021 · 50 comments
Open

2-Moment Moist is non-zero-diff with MAPL develop #762

mathomp4 opened this issue Mar 19, 2021 · 50 comments
Assignees
Labels
🪲 Bug Something isn't working ⌛ Long Term Long term issues

Comments

@mathomp4
Copy link
Member

This might be a "we expect that Matt" thing, but my tests last night showed that if you run MOM6 (or MOM5) with MAPL develop it is non-zero-diff to MAPL v2.6.3 (the version in GEOSgcm main). This is a real-life, state-is-different change:

$ nccmp -dqfSB stock-gcm-2021Mar19-1day-c90-MOM6/scratch/fvcore_internal_checkpoint mapldev-gcm-2021Mar19-1day-c90-MOM6/scratch/fvcore_internal_checkpoint
Variable Group   Count         Sum      AbsSum        Min     Max     Range        Mean      StdDev
PE       /     3497028 1.23538e+06 1.19895e+07   -196.425 249.111   445.536    0.353267     8.81481
PKZ      /     3494559     110.963     1086.91 -0.0181781 0.02742 0.0455981 3.17531e-05 0.000776577
PT       /     3499200     1031.71     79307.2    -13.587 13.1718   26.7588 0.000294841   0.0794892
U        /     3499200    -26866.4      819693   -25.0005 35.2817   60.2822 -0.00767788    0.457403
V        /     3499200     4055.96      810394   -32.9243  32.431   65.3552  0.00115911    0.448839

I am going to build GEOS with MAPL v2.6.4 and see if I can at least figure out where the change came in. (CC @sanAkel @yvikhlya )

@mathomp4 mathomp4 added the 🪲 Bug Something isn't working label Mar 19, 2021
@sanAkel
Copy link

sanAkel commented Mar 19, 2021

@mathomp4 Not sure what's different/changed?

MAPL develop it is non-zero-diff to MAPL v2.6.3
?

@yvikhlya
Copy link

Is it because of MAPL update or because of all the staff we merged into develop in GEOSgcm_GridComp?

@mathomp4
Copy link
Member Author

My test is the difference between using MAPL v2.6.3 and MAPL develop. That is all I changed in the nightly tests I'm comparing. The AGCM showed no difference at all, but MOM5/6 do.

@mathomp4
Copy link
Member Author

mathomp4 commented Mar 19, 2021

As I said above, I am building GEOS with MAPL v2.6.4 now. That can at least help us bracket where the difference occurs.

ETA: I should have those results in...30 minutes? I'm building with Release and then it's a 9-node C90 run to be consistent.

@mathomp4
Copy link
Member Author

Update. The difference appears to be between v2.6.4 and develop as v2.6.3 and v2.6.4 are zero-diff. Thus, it must be due to (one of?) these changes:

https://github.com/GEOS-ESM/MAPL/compare/v2.6.4..develop

which is...surprising. There is not much there. I'm going to start grabbing commits and see what I can see.

@tclune
Copy link
Collaborator

tclune commented Mar 19, 2021

In the words of Scooby-doo, "ruh-roh". Yeah - not a FP calculation in sight in those MAPL changes. I guess the next step is to start from 2.6.4 and cherry pick the other things on develop to see which breaks it. This of course assumes its not some weird combination.

But at best this must be some sort of memory corruption from what I can see.

@mathomp4
Copy link
Member Author

First update. These are fine:

  • base/MAPL_CubedSphereGridFactory.F90
  • CHANGELOG.md

@mathomp4
Copy link
Member Author

This of course assumes it's not some weird combination.

Yeah. This is the one that hopefully doesn't happen!

@sanAkel
Copy link

sanAkel commented Mar 19, 2021

First update. These are fine:

  • base/MAPL_CubedSphereGridFactory.F90
  • CHANGELOG.md

@yvikhlya can you list what you worked something out with @bena-nasa on using MAPL_CubedSphereGridFactory.F90 I recall there was some thing:

  • for HISTORY.rc which I am can not be seen in comparing fvcore_internal_checkpoint! so ✖️ that out!
  • there was some sea grid thing ...??

@mathomp4
Copy link
Member Author

I have an idea that I am testing now.

@mathomp4
Copy link
Member Author

First update. These are fine:

  • base/MAPL_CubedSphereGridFactory.F90
  • CHANGELOG.md

@yvikhlya can you list what you worked something out with @bena-nasa on using MAPL_CubedSphereGridFactory.F90 I recall there was some thing:

  • for HISTORY.rc which I am can not be seen in comparing fvcore_internal_checkpoint! so ✖️ that out!
  • there was some sea grid thing ...??

@sanAkel No. the CubeSphere change is fine. That one didn't cause the difference. But as I said above, I think I have an idea

@mathomp4
Copy link
Member Author

Huh. well, my idea didn't pan out. Hmm. I mean, I found a bug and I hoped that was the issue. But nope. Back to a careful survey!

@yvikhlya
Copy link

I never worked out anything in MAPL_CubedSphereGridFactory.F90

@mathomp4
Copy link
Member Author

Update. Looks like if you take all the files in develop except for profiler/BaseProfiler.F90 it is still zero-diff. Baffling. I'll start bisecting that file now...

@mathomp4
Copy link
Member Author

Testing seems to show that the change that matters was going from:

      if (this%stack%empty()) then
         block
           use MPI
           integer :: rank, ierror
           call MPI_Comm_rank(MPI_COMM_WORLD, rank, ierror)
           if (rank == 0) then
             _ASSERT(.false., "Timer "//name//' should not start when empty: ')
           end if
         end block
         return
      else
         node => this%stack%back()
         if (.not. node%has_child(name)) then
           m = this%make_meter()
           call node%add_child(name, m) !this%make_meter())
         end if
      endif

to:

      if (this%stack%empty()) this%status = INCORRECTLY_NESTED_METERS
      _ASSERT_RC(.not. this%stack%empty(),"Timer <"//name// "> should not start when empty.",INCORRECTLY_NESTED_METERS)


      node => this%stack%back()
      if (.not. node%has_child(name)) then
         m = this%make_meter()
         call node%add_child(name, m) !this%make_meter())
      end if

can cause the non-zero-diffness. 🤷🏼‍♂️

@sanAkel
Copy link

sanAkel commented Mar 19, 2021

@mathomp4, @tclune what does node%has_child(name) check? some introduction, please! Thanks!

@tclune
Copy link
Collaborator

tclune commented Mar 19, 2021

The timers form a tree of nodes. has_child(name) returns true/false depending on whether node has a child timer of the given name.

Note that the change above is just changing the structure of the logic. around the block with the has_child()

There really is no evidence that there is any bug anywhere around this implementation, and the implementation it replaced is provably buggy, though a very different sort of bug. The previous version would not have printed an error if a timer mismatch was detected on a non-root process. (An unlikely problem, but ...)

@sanAkel
Copy link

sanAkel commented Mar 19, 2021

@tclune Appreciate that explanation.

The previous version would not have printed an error if a timer mismatch was detected on a non-root process. (An unlikely problem, but ...)

In that case, and if this change is the sole reason for a different answer, I am willing to take the change, by categorizing it as "a bug fix"!

(Adding) @mathomp4 seems have shown that indeed ⤴️ is the only reason for non-zero-diff #762 (comment)

@tclune
Copy link
Collaborator

tclune commented Mar 19, 2021

It's not that simple though. (A) the bug it fixes was not happening in practice, and (B) the change should not affect results anyway.

I agree that we currently have no evidence that the new results are better or worse than the old, but the situation is simply not acceptable. I would prefer to revert the change (after maybe trying more variants) than assuming the change is ok. At least if we get the old answer, it's not the SI team (or my!) fault.

@sanAkel
Copy link

sanAkel commented Mar 19, 2021

It's not that simple though. (A) the bug it fixes was not happening in practice, and (B) the change should not affect results anyway.

I wasn't aware of ⤴️ (B), 🙇 @tclune

I agree that we currently have no evidence that the new results are better or worse than the old, but the situation is simply not acceptable. I would prefer to revert the change (after maybe trying more variants) than assuming the change is ok. At least if we get the old answer, it's not the SI team (or my!) fault.

of course not!

I had been planning on a new run with MOM6 (need to check somethings first). Is it worth:

  1. revert for now?
  2. try it in an experiment?
    So @GEOS-ESM/ocean-team and @GEOS-ESM/gmao-si-team 🤝 ?

@mathomp4
Copy link
Member Author

mathomp4 commented Mar 19, 2021

Update of extra tests running MOM6, 1 day, c90:

Compiler Build Type Zero-diff v2.6.3 v develop
Intel Release
Intel Debug ✔️
GNU Release ✔️
GNU Debug

Uhhhh... yeah. 🤷🏼‍♂️

@tclune
Copy link
Collaborator

tclune commented Mar 19, 2021

Ha! We have a clear direction to proceed!

@mathomp4
Copy link
Member Author

mathomp4 commented Mar 19, 2021

Note: I've probably never compared MAPL main v develop with GNU. For all I know that's normal. ETA: with MOM6. The AMIPs are still zero-diff between GNU Release and GNU Debug

All I really can say is that Intel Release main v develop should be zero-diff!

@tclune
Copy link
Collaborator

tclune commented Mar 19, 2021

@sanAkel We don't really have any advice at the moment. Quite a distressing situation.

@mathomp4
Copy link
Member Author

mathomp4 commented Mar 19, 2021

Indeed. This might provide impetus for a C24 MOM6 case if @sanAkel and @atrayano can whip one up. If nothing else, I could then run MOM6 in all 4 cases. At the moment, I only do Intel Release because C90 is not a cheap test (more so with Debug!)

@yvikhlya
Copy link

yvikhlya commented Mar 22, 2021

Also, though I'll probably need @sanAkel to try and interpret, but I edited MOM_override in each 1step and turned on all the verbose things:

VERBOSITY = 9
DEBUG = True
REPORT_UNUSED_PARAMS = True

I think I remember @sanAkel asking me to do this once, so I figured it couldn't hurt. The first difference I see seems to be under:

2685   │ NOTE from PE     0: callTree:       ---> step_MOM(), MOM.F90
@@ -2693,10 +2693,10 @@
 u-point: c=         0 sw=         0 se=         0 nw=         0 ne=         0 u Before steps  [uv]h
 v-point: mean=   0.0000000000000000E+00 min=   0.0000000000000000E+00 max=   0.0000000000000000E+00 v Before steps  [uv]h
 v-point: c=         0 sw=         0 se=         0 nw=         0 ne=         0 v Before steps  [uv]h
-u-point: mean=   9.6204014744316222E-04 min=  -9.7409898042678833E-01 max=   9.3869018554687500E-01 u Before steps forces%tau[xy]
-u-point: c=   1218438 W=   1218438 u Before steps forces%tau[xy]
-v-point: mean=  -5.8948045659188699E-04 min=  -6.5891179442405701E-01 max=   7.7305356915967050E-01 v Before steps forces%tau[xy]
-v-point: c=   1229423 S=   1219829 v Before steps forces%tau[xy]
+u-point: mean=   9.6497551562971977E-04 min=  -9.7442522644996643E-01 max=   9.3885707855224609E-01 u Before steps forces%tau[xy]
+u-point: c=   1219459 W=   1219459 u Before steps forces%tau[xy]
+v-point: mean=  -5.9255589285543144E-04 min=  -6.5889173746109009E-01 max=   7.7295936413891897E-01 v Before steps forces%tau[xy]
+v-point: c=   1229480 S=   1219860 v Before steps forces%tau[xy]
 h-point: mean=   0.0000000000000000E+00 min=   0.0000000000000000E+00 max=   0.0000000000000000E+00 Before steps forces%p_surf
 h-point: c=         0 Before steps forces%p_surf
 h-point: mean=   8.5618059390879411E-04 min=   0.0000000000000000E+00 max=   2.1526155318520304E-03 Before steps forces%ustar

so, Before steps forces%tau[xy] looks like the first time MOM does "something" in its run step? The mom source is a bit confusing to me 😄

This could be due to non zero difference in atmosphere stress forcing. This would have no effect in AMIP mode, since AMIP runs with data ocean.

@sanAkel
Copy link

sanAkel commented Mar 22, 2021

@mathomp4 It may seem unrelated but perhaps it does?
GEOS-ESM/GEOSgcm_GridComp#409 It is not something I understand well, GST, WW bug fix was tested with coupled right? I don' t recall testing its impact and neither do I see any note in the conversations. Of course, the only change here seems MAPL and change in the order of children as @tclune explained #762 (comment) And your testing shows it manifests in different configurations!

@mathomp4
Copy link
Member Author

Well all of my tests have that fix in it. So I am comparing like to like. I only changed the MAPL branch. That is it.

@mathomp4
Copy link
Member Author

As an extra data point, I build GEOS with FV_PRECISION=R8 and that does not fix this issue. So, it does not seem to be related to the r4/r8 FMS business. Maybe.

@mathomp4
Copy link
Member Author

Another update. @tclune had a thought to try both MAPL main and develop but just compile MOM6 with Debug while keeping everything else in Release. However, doing this, still is non-zero-diff between MAPL main and develop.

@mathomp4
Copy link
Member Author

Since this is MOM related, I suppose my next test will be to turn off heap arrays in...MAPL? Everywhere? We know MOM was sensitive to heap-arrays in FMS, maybe this is now showing its head in MAPL

@mathomp4
Copy link
Member Author

Well, not sure what's happening. I've tried:

  • -no-heap-arrays everywhere: FAIL
  • Only MOM6 Debug: FAIL
  • Only FMS Debug: FAIL
  • Only MAPL.profiler Debug: FAIL
  • Only MAPL.profiler and MAPL.shared Debug: FAIL

@sanAkel
Copy link

sanAkel commented Mar 24, 2021

Well, not sure what's happening. I've tried:

  • -no-heap-arrays everywhere: FAIL
  • Only MOM6 Debug: FAIL
  • Only FMS Debug: FAIL
  • Only MAPL.profiler Debug: FAIL
  • Only MAPL.profiler and MAPL.shared Debug: FAIL

@mathomp4 what does
FAIL =

  • crash?
  • non-zero diff?
  • ...?

@mathomp4
Copy link
Member Author

FAIL means that MAPL main does not equal MAPL develop

@sanAkel
Copy link

sanAkel commented Mar 24, 2021

FAIL means that MAPL main does not equal MAPL develop

Does that mean, -no-heap-arrays is not related to Debug build for MOM6, FMS, MAPL, etc? basically #762 (comment) is a dead end? 😞

@mathomp4
Copy link
Member Author

FAIL means that MAPL main does not equal MAPL develop

Does that mean, -no-heap-arrays is not related to Debug build for MOM6, FMS, MAPL, etc? basically #762 (comment) is a dead end? 😞

Yeah. All my tests so far have shown no change (save for all Debug)

@mathomp4 mathomp4 changed the title Coupled Model is non-zero-diff with MAPL develop 2-Moment Moist is non-zero-diff with MAPL develop Apr 7, 2021
@mathomp4
Copy link
Member Author

mathomp4 commented Apr 7, 2021

After many many tests, it looks like it is the 2-moment microphysics which causes this. The good news is we can see this without needing to run MOM which allows for tinier problems. Might be something to bring in @dbarahon at some point to see if maybe 2-moment has a possible memory issue somewhere? 🤷🏼‍♂️

@mathomp4
Copy link
Member Author

Quoth @tclune: Bill is not surprised. Closing.

@sanAkel
Copy link

sanAkel commented Jul 27, 2022

@dbarahon I just re-opened this issue for your attention. Thanks!

@dbarahon
Copy link

dbarahon commented Jul 27, 2022

Thought it was resolved. Is this still the issue? 2-Moment Moist is non-zero-diff with MAPL develop
#762

@sanAkel
Copy link

sanAkel commented Jul 27, 2022

After many many tests, it looks like it is the 2-moment microphysics which causes this. The good news is we can see this without needing to run MOM which allows for tinier problems. Might be something to bring in @dbarahon at some point to see if maybe 2-moment has a possible memory issue somewhere? 🤷🏼‍♂️

@dbarahon please see ⬆️ . Essentially the 2-moment option is used only within the scope of coupled model and that's how we keep tripping on this and other issue(s) that I pointed you two. But this issue clearly tells us that the problem with 2-moment option is not-regressing even in un-coupled (AMIP) mode.

Thought it was resolved. Is this still the issue? #762
#762

No. This was closed because the problem was certainly not due to MAPL, but choosing 2-moment option.

@mathomp4 mathomp4 pinned this issue Aug 8, 2022
@mathomp4 mathomp4 unpinned this issue Aug 8, 2022
@stale
Copy link

stale bot commented Sep 25, 2022

This issue has been automatically marked as stale because it has not had recent activity. If there are no updates within 7 days, it will be closed. You can add the "long term" tag to prevent the Stale bot from closing this issue.

@stale stale bot added the ❄️ Stale This issue has been marked stale label Sep 25, 2022
@mathomp4
Copy link
Member Author

Not stale...though still unknown.

@sanAkel sanAkel added the ⌛ Long Term Long term issues label Sep 26, 2022
@stale stale bot removed the ❄️ Stale This issue has been marked stale label Sep 26, 2022
@sanAkel
Copy link

sanAkel commented Sep 26, 2022

@mathomp4 indeed status is unknown. Added long term so it does not get auto-closed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🪲 Bug Something isn't working ⌛ Long Term Long term issues
Projects
None yet
Development

No branches or pull requests

7 participants