Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hard coded 3 hourly frequency in gocart2g #146

Closed
bena-nasa opened this issue May 2, 2022 · 10 comments
Closed

Hard coded 3 hourly frequency in gocart2g #146

bena-nasa opened this issue May 2, 2022 · 10 comments

Comments

@bena-nasa
Copy link
Contributor

@amdasilva @pcolarco @adarmenov @weiyuan-jiang @tclune @mathomp4

There is this issue reported in UFS that appears to involve gocart2g:
ufs-community/ufs-weather-model#1190

now Matt noticed that a 4 hour vs 2 + 2 start/stop regression was failing. We then remembered there is a hard coded 3 hourly alarm in gocart2g! In these spots:
https://github.com/GEOS-ESM/GOCART/blob/v2.0.6/ESMF/GOCART2G_GridComp/NI2G_GridComp/NI2G_GridCompMod.F90#L393
https://github.com/GEOS-ESM/GOCART/blob/v2.0.6/ESMF/GOCART2G_GridComp/SU2G_GridComp/SU2G_GridCompMod.F90#L480

If I change this to 2 hours, the 4 hour vs 2 + 2 regression passes. So it does seem possible this could be the problem for UFS. At the very least it seems like either whatever the point of this alarm is it either needs to be divisible by the job segment or perhaps we are not saving something in the internal state that needs to be checkpointed.

@pcolarco
Copy link
Contributor

pcolarco commented May 2, 2022

@bena-nasa I won't have the time this week to give a thoughtful look at this, sorry. But if we add H2O2 to the internal state in GOCART2G (this is I think the field which requires the alarm cycling) that will solve this problem?

@bena-nasa
Copy link
Contributor Author

@pcolarco thanks for the fast response. I agree, based my quick look at the code, maybe adding h2o2 to the internal state could solve it. I'll just try adding that here real quick and see if it fixes the issue

@bena-nasa
Copy link
Contributor Author

bena-nasa commented May 2, 2022

Oh boy, there are multiple things going on here.
For starters, that alarm is created in both components with a reference time of the current clock time when you start the model. It looks like someone was trying to do the right thing but then didn't (see the ringtime set to 0z of the current day but then never actually used) So that means if you start at 21z it will ring at 0z, 03z, etc..., but if you start at 21z, then stop at 23z, then restart, it will ring at 2z, 5z, etc. It needs to be relative to the same time each day otherwise this just doesn't make sense. I will need to fix this first. I'm not clear after my first code readthrough if that is enough or if I will also need to add something to the checkpoint.

@amdasilva
Copy link
Collaborator

amdasilva commented May 2, 2022

There was a reason why the 3 hour alarm was hardwired, as not to give the user the illusion that they could specify any other value. An easier solution may involve changing the way we handle these oxidants. The way this oxidant is "recycled" always apperead contrived in my opinion. So, stop trying to find a way to address this in code. There is no deep mandate to keep this algorithm. Let us discuss this in our aerosol group meeting.

@tclune
Copy link
Contributor

tclune commented May 3, 2022

At the bare minimum - if the 3 hour cannot be changed and it is not safe to checkpoint at other intervals, then invalid start/checkpoint times must be detected and provide a clear failure message. Getting plausible but incorrect results is (almost) the worst possible failure.

@bena-nasa
Copy link
Contributor Author

bena-nasa commented May 3, 2022

@tclune @amdasilva I tried for quite a while yesterday to fix this. I think there are two issues.

  1. The alarm created needs the reference time to be set to a fix time each day not the model start time as it currently is.
  2. This 3 hourly field needs to be checkpointed.

Looking at the code I thought this would be a very trivial modification. Unfortunately we have another case of ESMF alarms not functioning properly. I tried setting the initial ringtime to 0z of that day (indeed if you look at the code some did set a new time to be 0z of the current day, then it is not used a few lines later when the alarm is created). In the 4 hour run the alarm ran at 21z but did not ring at 0z. In the 2+2 run the alarm ran at 21z, then again at 23z when the model was restarted. ESMF is well aware of the alarm issues and has been rewriting the alarm algorithms to a point where I can test them.

@bena-nasa
Copy link
Contributor Author

@tclune @amdasilva @rmontuoro
To summarize the issues:

  1. This needs to be the ringtime that is set 0z of the current day rather than the current time of the clock: https://github.com/GEOS-ESM/GOCART/blob/v2.0.7/ESMF/GOCART2G_GridComp/NI2G_GridComp/NI2G_GridCompMod.F90#L403
  2. This internal variable that is updated by the import periodically needs to be checkpointed if there is any possibility of the model being run NOT at the alarm frequency:
    https://github.com/GEOS-ESM/GOCART/blob/v2.0.7/ESMF/GOCART2G_GridComp/NI2G_GridComp/NI2G_GridCompMod.F90#L794
    If you look self%xhno3 is passed to subroutines a little later where it is modified. So this needs to checkpointed in the ESMF internal state. To do this a internal new variable must be added to the statespec. Then at the end of the run method self%xhno3 should be copied to this the pointer that points to the internal state variable.

The issue however is that I tried exactly this and found that the ESMF_Alarm was not behaving properly as detailed in the previous comment.
My recommendation would be to implement a "simple alarm" for now we test the new ESMF_Alarm implementation and figure this. We could easily implement an alarm like class in the component internal state that rings at a periodic time.

@bena-nasa
Copy link
Contributor Author

bena-nasa commented Jun 17, 2022

@amdasilva @pcolarco @tclune @weiyuan-jiang @rmontuoro
Update to this issue. As originally reported here, there is a hard coded alarm in the nitrate component, that if you do not run your segments so they start and stop when the alarm is ringing, you can not get start/stop regression.

As I said, I thought there were 2 issues, 1 the alarm was just being created in a bad way, and needed to be relative to the same time each day, not whenever you started the model! And the self%xhno3 needed to be saved to the internal state. I had tried both of these but the issues with the esmf alarm not behaving right after I "fixed" it to be relative to 0z of the current day when created meant this was doomed.

I've been testing a new ESFM alarm branch where they have completely rewritten the alarm implementation to fix the issues with it, and indeed, now the alarm is behaving properly. (An alarm set to ring at 0z and every n hours after that does actually do that). I then tested the regression with the new ESMF branch + fixes to what I thought were the issues with the NI GridComp that I've detailed in several post in this issue

But, despite this, it still did not pass a 2 hour + 2 hour regression with a 3 hourly alarm in the code.

I even tried, running say a 2 step run starting at 21z vs 2 1 step runs! This still fails, the strange thing is that all the restarts are zero diff except the SU restart. If you let regression run longer than the alarm frequency everything goes non-zero diff.

I noticed that su has the same bad logic for an alarm and has special first time the gridcomp executes, so I can't how this would regress either! I will fix that and report if that fixes the regression problems. Hopefully there is no more specious logic like this in the other grid comps :(

@bena-nasa
Copy link
Contributor Author

bena-nasa commented Jun 17, 2022

@amdasilva @pcolarco @tclune @weiyuan-jiang @rmontuoro
Ok, I figured out how to fix this, mystery solved. The bottom line is

  1. The code in Sulfate and Nitrate will never pass a regression test unless the execution is exaction a multiple of the alarm segment. They are just coded incorrectly.
    I've committed the proper way this these routines should have been coded on this branch:
    https://github.com/GEOS-ESM/GOCART/tree/feature/bmauer/allow_arbitary_start_and_stop

  2. Even with this fix, the ESMF alarms in the current esmf release just is not functioning properly. So either the alarm must be bypassed for now with some other alarm type mechanism until the updated ESMF alarm implementation is released hopefully in ESMF 8.4

@bena-nasa
Copy link
Contributor Author

bena-nasa commented Apr 4, 2023

@amdasilva @tclune would you like me to update the branch I referenced above, merge develop, and invent a little bypass to the ESMF alarms so we can just fix this so that you can do start/stop regression for as long as this 3 hourly logic exists? We really need to be able to do start/stop regression tests at something less than 3 hours, that is just hinderance to debugging and I have the solution with just a little more work as detailed above.

There have been a couple people I've been called on to help trying to do start/stop regression who wasted a bunch of time before talking to myself or Matt because they were not aware of this "limitation" when doing such tests. If the problem were just fixed (and it's not hard to fix...) it would save others confusion and time in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants