Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

calendar error in ESMF #1858

Closed
wwieder opened this issue Sep 21, 2022 · 15 comments
Closed

calendar error in ESMF #1858

wwieder opened this issue Sep 21, 2022 · 15 comments
Assignees
Labels
bug something is working incorrectly priority: high High priority to fix/merge soon, e.g., because it is a problem in important configurations

Comments

@wwieder
Copy link
Contributor

wwieder commented Sep 21, 2022

Brief summary of bug

It looks like changes to cdeps, cmeps, or other externals have introduced an error in how we were spinning up NEON cases.
We're getting the following error in the PET0.ESMF_LogFile

PET0 ESMCI_Calendar.C:1059 ESMCI::Calendar::convertToTime() Input argument out of range  - ; Gregorian: for February 100, dd=29 > 28 days in the month.

General bug information

CTSM version you are using: [output of git describe]

ctsm5.1.dev108

Does this bug cause significantly incorrect results in the model's science? [Yes / No]

Yes, model can't run for spinup over 100 years.

Configurations affected: [Fill this in if known.]

Currently working in just NEON cases, but this should be true in any datasets with Gregorian calendars?

Details of bug

see example here /glade/scratch/negins/neon_ctsm_v2_final/tools/site_and_regional/BART.ad/run
using calendar 'gregorian'
similarly results when we use no-leap
/glade/scratch/negins/ctsm_calendar_error/tools/site_and_regional/BART.ad/run

Important details of your setup / configuration so we can reproduce the bug

This is an issue for trying to spinup NEON simulations, but may present errors with other long spinups with input data that has leap years

Important output or errors that show the problem

Previously spinups were done using a no-leap calendar, but this does not seem to work anymore. @jedwards4b do you have any suggestions here?

@wwieder wwieder added tag: bug - critical priority: high High priority to fix/merge soon, e.g., because it is a problem in important configurations bug something is working incorrectly labels Sep 21, 2022
@billsacks
Copy link
Member

Without knowing the details of what you're doing here, it seems to me like this problem at year 100 is not an issue with ESMF, but rather is an issue with trying to cycle through real-world data for spinup. I'm imagining something like: you have data for some number of years, some of which are leap years, and you are trying to cycle through them in a spinup run by using the GREGORIAN calendar. But the problem is that it's not quite right to assume that every 4th year is a leap year: with the GREGORIAN calendar, years 100, 200, 300, 500, 600, 700, 900, etc. are NOT leap years (years 400, 800, etc. ARE leap years). That is presumably the source of the crash you're getting. If you used a JULIAN calendar instead of GREGORIAN you wouldn't have this issue, but I don't think CESM is set up to allow use of a JULIAN calendar. (See also http://earthsystemmodeling.org/docs/nightly/develop/ESMF_refdoc/node6.html#sec:Calendar).

Your best bet may be to try to get this working again with a NOLEAP calendar for spinup.

@wwieder
Copy link
Contributor Author

wwieder commented Sep 21, 2022

Thanks for this clarification, Bill. Do you have suggestions, because this case does use a NO_LEAP calendar, but has the same error that's listed above.
env_build.xml: <entry id="CALENDAR" value="NO_LEAP">
/glade/scratch/negins/ctsm_calendar_error/tools/site_and_regional/BART.ad

@billsacks
Copy link
Member

Oh, okay. I hadn't understood that the error was the same for a noleap case. Is it possible that there are two different calendars – one for the model and a separate calendar for the data – and that cdeps is determining the data's calendar based on some metadata on the forcing files, so that even though you're doing a noleap run, the data's calendar is still set to gregorian?

@wwieder
Copy link
Contributor Author

wwieder commented Sep 21, 2022

Maybe we can discuss this briefly the SE meeting tomorrow, as I don't really understand the feasibility of this suggestion. Can we revert back to an old cdeps tag to get this to work in the short term, or are there cdeps changes that we can make so that a no-leap calendar operational for the spinup?

@jedwards4b
Copy link
Contributor

@wwieder I will look into this and either provide a new cdeps with a fix or recommend an older one that will work.

@mvertens
Copy link

@wwieder - can you please point me to a case where the no-leap showed the same problem.

@mvertens
Copy link

@wwieder - sorry - I just saw the case with no leap. I'll look there.

@mvertens
Copy link

To clarify - the forcing data has a gregorian calendar whereas the crash occurs when the model has either a NO_LEAP or a GREGORIAN calendar. You can see that the forcing data is gregorian from the following:

ncdump -h /glade/scratch/negins/ctsm_calendar_error/tools/site_and_regional/BART.ad/run/inputdata/atm/cdeps/v2/BART/BART_atm_2018-01.nc
......
double time(time) ;
time:units = "days since 2018-01-01 00:00:00" ;
time:long_name = "time" ;
time:calendar = "gregorian" ;

So it looks like the model is dying in 2020 trying to read in Feb. 29 data - which does not exist on the input dataset.
The latest time sample in /glade/scratch/negins/ctsm_calendar_error/tools/site_and_regional/BART.ad/run/inputdata/atm/cdeps/v2/BART/BART_atm_2020-02.nc is 28.9791666666667.
So I think the problem is that you are telling cdeps that your data is on a gregorian calendar but you don't have data for Feb.29 on a leap year. I don't see how this could have ever worked in the past.
One way to fix this is to set the calendar attribute to no_leap for The latest time sample in /glade/scratch/negins/ctsm_calendar_error/tools/site_and_regional/BART.ad/run/inputdata/atm/cdeps/v2/BART/BART_atm_2020-02.nc and see if you can get past this point.

Does my explanation make sense?

@wwieder
Copy link
Contributor Author

wwieder commented Sep 22, 2022

Hi @mvertens thanks for looking into this. I'm a little confused because the 2020-02 data has a time dimension = 1392. With 30 minute data that's 29 days. Moreover doesn't time start at 0, so the last timestep would be for Feb 29, even though its value is 28.97... (e.g. the last time step for January is 30.9791666666667, which seems to run fine)?

We can try setting the calendar attribute to no_leap, but it seems like this is not accurate for the data being provided. Finally, @negin513 ran this site with the same input data with an older CTSM tag and didn't have any issues. We can get NEON to reprocess all their input data, but before doing that want to make sure this isn't an issue on our end, as running with tower data like this is a common application of CTSM (beyond NEON).

@mvertens
Copy link

@wwieder - you are right. I did not read the time dimension correctly. Sorry about that. What was the older CTSM tag? It would be good to compare the CDEPS differences between the two tags.

@wwieder
Copy link
Contributor Author

wwieder commented Sep 22, 2022 via email

@mvertens
Copy link

@wwieder - so looking at the PET0 error again -
Gregorian: for February 100, dd=29 > 28 days in the month.
However, when I look at the atm.log file for the case where the model calendar is also gregorian - I get numerous correct reads:

(shr_strdata_readstrm) reading file ub: /glade/scratch/negins/neon_ctsm_v2_final/tools/site_and_regional/BART.ad/run/inputdata/atm/cdeps/v2/BART/BART_atm_2020-02.nc 1392
atm : model date 240229 82800
(shr_strdata_readstrm) close : /glade/scratch/negins/neon_ctsm_v2_final/tools/site_and_regional/BART.ad/run/inputdata/atm/cdeps/v2/BART/BART_atm_2020-02.nc
(shr_strdata_readstrm) opening : /glade/scratch/negins/neon_ctsm_v2_final/tools/site_and_regional/BART.ad/run/inputdata/atm/cdeps/v2/BART/BART_atm_2020-03.nc
(shr_strdata_readstrm) reading file ub: /glade/scratch/negins/neon_ctsm_v2_final/tools/site_and_regional/BART.ad/run/inputdata/atm/cdeps/v2/BART/BART_atm_2020-03.nc 1
atm : model date 240229 84600
(shr_strdata_readstrm) reading file ub: /glade/scratch/negins/neon_ctsm_v2_final/tools/site_and_regional/BART.ad/run/inputdata/atm/cdeps/v2/BART/BART_atm_2020-03.nc 2
atm : model date 240301 0

.....

(shr_strdata_readstrm) reading file ub: /glade/scratch/negins/neon_ctsm_v2_final/tools/site_and_regional/BART.ad/run/inputdata/atm/cdeps/v2/BART/BART_atm_2020-02.nc 1392
atm : model date 280229 82800
(shr_strdata_readstrm) close : /glade/scratch/negins/neon_ctsm_v2_final/tools/site_and_regional/BART.ad/run/inputdata/atm/cdeps/v2/BART/BART_atm_2020-02.nc
(shr_strdata_readstrm) opening : /glade/scratch/negins/neon_ctsm_v2_final/tools/site_and_regional/BART.ad/run/inputdata/atm/cdeps/v2/BART/BART_atm_2020-03.nc
(shr_strdata_readstrm) reading file ub: /glade/scratch/negins/neon_ctsm_v2_final/tools/site_and_regional/BART.ad/run/inputdata/atm/cdeps/v2/BART/BART_atm_2020-03.nc 1
atm : model date 280229 84600
(shr_strdata_readstrm) reading file ub: /glade/scratch/negins/neon_ctsm_v2_final/tools/site_and_regional/BART.ad/run/inputdata/atm/cdeps/v2/BART/BART_atm_2020-03.nc 2
atm : model date 280301 0

And the model seems to complete - but we get a PET0 log file with an error.

Not sure why this is happening.

@negin513
Copy link
Contributor

I am tracking down which CDEPS version is causing this issue. But here is an example of a similar run that has completed with ctsm5.1.dev098.

/glade/scratch/negins/neon_v2/tools/site_and_regional/KONA.ad

I compared the inputdata (2020-02) from this to the current runs and it seems like they are identical:

 cprnc -m /glade/scratch/negins/neon_ctsm_v2_final/tools/site_and_regional/KONA.ad/run/inputdata/atm/cdeps/v2/KONA/KONA_atm_2020-02.nc /glade/scratch/negins/neon_v2/tools/site_and_regional/KONA.ad/run/inputdata/atm/cdeps/v2/KONA/KONA_atm_2020-02.nc |tail -n 10
 A total number of     20 fields were compared
          of which      0 had non-zero differences
               and      0 had differences in fill patterns
               and      0 had different dimension sizes
               and      0 had different data types
 A total number of      0 fields could not be analyzed
 A total number of      0 fields on file 1 were not found on file 2.
 A total number of      0 fields on file 2 were not found on file 1.
  diff_test: the two files seem to be IDENTICAL 

@ekluzek ekluzek added the next this should get some attention in the next week or two. Normally each Thursday SE meeting. label Sep 22, 2022
@jedwards4b
Copy link
Contributor

I would like to propose something like this as a solution: ESCOMP/CDEPS#191
If ESMF returns a calendar error when date is 0229, then just shift the date to 0301 and try again.

@billsacks billsacks removed the next this should get some attention in the next week or two. Normally each Thursday SE meeting. label Sep 22, 2022
@jedwards4b
Copy link
Contributor

The error was introduced in cdeps v0.12.42.

This was referenced Oct 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug something is working incorrectly priority: high High priority to fix/merge soon, e.g., because it is a problem in important configurations
Projects
No open projects
Development

No branches or pull requests

6 participants