Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

what ocean and sea ice initialization files should we be using? #1580

Closed
worleyph opened this issue Jun 9, 2017 · 56 comments
Closed

what ocean and sea ice initialization files should we be using? #1580

worleyph opened this issue Jun 9, 2017 · 56 comments

Comments

@worleyph
Copy link
Contributor

worleyph commented Jun 9, 2017

This question first occurs in issue #1547 (posed by @ndkeen ) , but wanted to give it its own home. This is a question for the science and coupled groups, but impacts the performance group (among others).

The current ocean (and sea ice) iniitalization files for the water cycle compsets are not recognized as pnetcdf files within PIO, and the netcdf file reads used instead take a long time, making initialization very expensive. The ocean group provided new files that do not have this problem, but these files generate warnings (both for G and water cycle compsets) of the form

 ERROR: Warning: abs(sum(h)-bottomDepth)>2m. Most likely, initial layerThickness does not match bottomDepth.

when used with current master.

Is there a timeline for updating the water cycle compset definitions to use "true" pnetcdf files? And do the warnings about the ocean initial data impact the performance of the model? Would hate to be using data for these runs to optimize they code performance when they are not representative of production runs.

@rljacob
Copy link
Member

rljacob commented Jun 9, 2017

@jayeshkrishna is PIO somehow mis-identifying the file? I didn't think "pnetcdf" was a file format. There is cdf-2 or cdf-5, etc.

@worleyph
Copy link
Contributor Author

worleyph commented Jun 9, 2017

Repeating relevant comments from issue #1547:

In the PIO funtion open_nf (in file ionf_mod.F90) is logic to try to open a file using a given PIO type. If this fails, then the PIO type is changed. For oRRS18to6v3.170111.nc, the PIO file type starts as '5' (== PIO_iotype_pnetcdf). However, the error return from

ierr = nfmpi_open(File%iosystem%IO_comm,fname,amode,File%iosystem%info,File%fh)

is -51, and the PIO type is then changed to '8' (== pio_iotype_netcdf4p).

...
The error (-51) is

PARAMETER (NF_ENOTNC = -51) ! Not a netcdf file

(but this file can then be opened and read with pio_iotype_netcdf4p option).

@ndkeen
Copy link
Contributor

ndkeen commented Jun 10, 2017

Thanks Pat. Just adding a note:
@vanroekel told me at the all-hands meeting we would give me a streams file (hopefully both a ocean and ice) that he thought was consistent. What I thought I heard Luke say was that if we were going to use files such as these (in streams.ocean and streams.cice):

/global/cscratch1/sd/lvroekel/mpaso.rst.0039-01-01_00000.nc
/global/cscratch1/sd/lvroekel/mpascice.rst.0039-01-01_00000.nc

which seem to allow faster init times (at least for OCN), then there is another filename entry in one of those files that must be changed for consistency and potentially avoiding errors.

@worleyph
Copy link
Contributor Author

Thanks for the update. The coupled group should also weigh in, and hopefully change the compset definitions. Faster initialization (to this degree) is also important for production runs.

@PeterCaldwell
Copy link
Contributor

I 100% agree that we should be using pnetcdf files rather than netcdf because the performance impact of this choice is huge. I also agree that we need to make sure the new files we have are producing credible answers. Thanks for bringing this issue up, @worleyph and @ndkeen ! I don't know anything about which files would be correct vs wrong. @jonbob , @vanroekel , @mark-petersen - Could you talk amongst yourselves and assign someone to sort this out (as a "critical path" task?

@toddringler
Copy link

I might be wrong here .... but I think that ocean/ice team found that the choice of pnetcdf / netcdf depended on file size. @jonbob or @mark-petersen will be able to provide details.

@jonbob
Copy link
Contributor

jonbob commented Jun 12, 2017

@toddringler is correct -- we've found that having large arrays in the IO requires us to use "pnetcdf,cdf5" for several of the ocn/ice streams. On some really small files, however, pnetcdf can be slower -- so we've moved to assigning pio types to individual files/streams. @PeterCaldwell , we have an issue to track this (#1451 that's been assigned to me -- so I'll also take it on as a "critical path" task.

@worleyph
Copy link
Contributor Author

@jonbob and @toddringler , just to be clear, the problem is that the default ocean 18to6 inititalization file for the G and A_WCYCL compsets can not be read as a pnetcdf file (even using pnetcdf,cdf5) , as described in the first comment. An alternative file provide by @mark-petersen does work, but also generate many warnings when the model starts. As such, neither are appropriate for use in production runs, and something else is needed. Based on the above comment by @ndkeen, the alternative file can be made to work with some other changes in the streams.ocean file. Once this is confirmed, the coupled group should consider changing the definition of the relevant compsets.

@worleyph
Copy link
Contributor Author

Correction - think that @vanroekel provided the alternative ocean (and sea ice) files - sorry for the misattribution Luke.

@vanroekel
Copy link
Contributor

so a quick note of clarity here. There is no such thing as a pnetcdf file. The issue is how PIO interfaces with the netcdf file. The main issue here is whether the file is cdf-2 or cdf5. What was happening in @worleyph and @ndkeen runs was due to the ocean init file being a cdf-2 type, but with some variables that violate the cdf-2 variable size constraint. This was not an issue when using parallel-netcdf/1.5.0 as there was a soft bug that did not check variable sizes. When running with newer versions of parallel-netcdf the netcdf interface would be used to read init files as the pnetcdf variable size check returned an error on the original files. The newer files I have sent do not have this issue as they are cdf-5, which is not subject to variable size constraints.

@ndkeen I have placed new ocean and sea ice restart files you can use for your tests on KNL. I also have included a file with the necessary streams.cice changes for you. They are at

/global/cscratch1/sd/lvroekel/restart_files_for_noel

@worleyph these files are also on titan if you want to try as well at

/lustre/atlas1/cli115/scratch/vanroek/highres_restarts

@worleyph
Copy link
Contributor Author

worleyph commented Jun 15, 2017

@vanroekel , thank you for clarifying and cleaning up some (my) sloppiness in the discussion. Again, I believe that it is important for the coupled group to be aware of this issue and to make sure that they are using ocean and sea ice initialization and restart files that are cdf5 files for high resolution runs. It is also important for the default initialization files for the G and A_WCYCL compsets for high resolution meshes to be cdf5 files. I do not know who should be responsible for this latter task, but it probably is not @ndkeen or myself.

@worleyph
Copy link
Contributor Author

@vanroekel , with regard to (thanks for providing this)

 /lustre/atlas1/cli115/scratch/vanroek/highres_restarts

there is also a pointer to seaice.RRS18to6v3.restartFrom_titan0228.170321.nc in

 <stream name="landIceMasks"
         type="input"
         io_type="pnetcdf"
         input_interval="initial_only"
     filename_template="/lustre/atlas1/cli900/world-shared/cesm/inputdata/ice/mpas-cice/oRRS18to6v3/seaice.RRS18to6v3.restartFrom_titan0228.170321.nc">

Should this also be replaced by mpascice.rst.0053-01-01_00000_noxtime.nc ?

@ndkeen
Copy link
Contributor

ndkeen commented Jun 16, 2017

I also had a question about the editing of the streams file.

I made a copy of the files and created the .rst53 streams file below:

cori12% pwd
/global/cscratch1/sd/ndk/mpas-restart-files
cori12% ls -l
total 35983660
drwxrwxr-x  2 ndk ndk        4096 Jun 15 20:26 ./
drwxr-xr-x 63 ndk ndk       20480 Jun 15 20:20 ../
-rw-r--r--  1 ndk ndk  8216369200 Jun 15 20:17 mpascice.rst.0053-01-01_00000_noxtime.nc
-rw-r--r--  1 ndk ndk 28630788832 Jun 15 20:19 mpaso.rst.0053-01-01_00000_noxtime.nc
-rw-r--r--  1 ndk ndk       11172 Jun 15 20:18 streams.cice.rst39
-rw-r--r--  1 ndk ndk       11542 Jun 15 20:26 streams.cice.rst53
-rw-r--r--  1 ndk ndk         718 Jun 15 20:19 streams.cice_additions
-rw-r--r--  1 ndk ndk       20122 Jun 15 20:18 streams.ocean.rst39
-rw-r--r--  1 ndk ndk       20166 Jun 15 20:21 streams.ocean.rst53

Can you verify I've made the correct edits? For the ocean, I simply used the mpaso rst53 file in 2 places. For the cice, I added the two entries you suggested, but a) one of them was already there and b) there was another that looks like it should also be changed to use mpascice rst53 files as well.

@worleyph
Copy link
Contributor Author

@vanroekel , FYI - I am still getting the warnings ("Warning: abs(sum(h)-bottomDepth)>2m. ..."). Wasn't sure if the new restart files and changes to streams files was supposed to eliminate these or not.

@jayeshkrishna
Copy link
Contributor

@vanroekel : Are mpaso.rst.0039-01-01_00000.nc and mpascice.rst.0039-01-01_00000.nc available on Titan ?

@jonbob
Copy link
Contributor

jonbob commented Jun 19, 2017

@worleyph - The new files won't get rid of the warnings, but we should change the logging so it doesn't look like a fatal error.

@worleyph
Copy link
Contributor Author

Luke gave me 0053 (not 0039) most recently. See the above comment from Luke as to the location of these files and example streams files on Titan.

@jayeshkrishna
Copy link
Contributor

jayeshkrishna commented Jun 19, 2017

Does a run with * 0053 * files complete successfully (I am currently only looking at I/O performance, so warnings are ok for me as long as runs complete as expected)?

@worleyph
Copy link
Contributor Author

@jonbob, do the warnings indicate that the simulation will not be realistic (performance-wise)? Mark Petersen was going to check on this at one point, but I never heard back what his conclusion was.

@worleyph
Copy link
Contributor Author

@jayeshkrishna, yes the jobs run successfully using 0053 (and 0039). You'll have to ask Luke why he had us update to 0053.

@jonbob
Copy link
Contributor

jonbob commented Jun 19, 2017

@mark-petersen will know more, but I don't believe it means the simulation is not realistic -- just has some inconsistencies in its initial conditions. I see those warnings in virtually every run lately. I'll open an MPAS issue so we don't forget about tracking this down.

@vanroekel
Copy link
Contributor

@jonbob and @worleyph I have seen these warnings as well and I have seen no visible issues with the simulation. I don't recall seeing it frequently. The last time I recall seeing this was when I updated to CIME53, which was a bit of a harder restart than usual. I had those warnings for 1 submission and then no issues since. @worleyph could you try a restart after your short test? i.e. do a 5-day simulation with warnings and then another 5-days with a restart and see if the warnings remain?

Regarding year 53, vs. 39. There is no performance reason for this. I thought I'd pull the latest restart as the ocean will have a more spun-up circulation. But I wouldn't expect any performance change.

@worleyph
Copy link
Contributor Author

  @worleyph could you try a restart after your short test? 

I'm still traveling - home next week. Can't guarantee to do this until then. Perhaps someone else wants to volunteer.

My concern is based on @mark-petersen 's comment

 @worleyph that error message means the initial sea surface height is unphysical. It may indicate that variables are read in incorrectly, or initialized to zero.

and Mark was going to verify that the results from my run were reasonable.

@jonbob
Copy link
Contributor

jonbob commented Jun 20, 2017

@ndkeen - I checked your directory and there is no ocean restart file? But I think you followed the correct steps. Can you redo the original run (meaning set CONTINUE_RUN back to false) and let's make sure all the restart files get output correctly?

@ndkeen
Copy link
Contributor

ndkeen commented Jun 20, 2017

OK, will do. Not sure what happened as I set REST_OPTION/REST_N to be same as STOP_OPTION/STOP_N:

    <entry id="REST_OPTION" value="ndays">
    <entry id="REST_N" value="5">

But I admit I'm not sure I verified the restart files were there as I had tested it previously.

@jonbob
Copy link
Contributor

jonbob commented Jun 20, 2017

I think you did everything correctly. I'm trying to see if something got messed up, but the ocn and acme logs both closed successfully. So nothing is jumping out -- other than running it again and making sure the restart file shows up. And I don't check for them either, so who knows? IO at high-res is just problematic...

@ndkeen
Copy link
Contributor

ndkeen commented Jun 20, 2017

As Cori scratch will start enforcing purgingsoon, I've been trying to backup to tape. And before doing that, I have been very busy trying to remove large files from directories, so it's possibly I accidentally deleted the restart files. I can try in another directory that does have the restart files, but also restarted a fresh experiment just now.

It would also be nice if the code could let us know a little sooner that the restart files are missing.

@jonbob
Copy link
Contributor

jonbob commented Jun 20, 2017

That would make some sense -- everything looked set up to correctly output the files. And no sign in any log of any problems. So yes, please try again and let me know

@jonbob
Copy link
Contributor

jonbob commented Jun 20, 2017

And I'll see if it's possible for restart files to get included in the check_input_data. I'm not sure, because the codes have to parse the rpointer files to fully extract the filenames. But it's always hard to have issues with missing files after getting through the queues

@ndkeen
Copy link
Contributor

ndkeen commented Jun 21, 2017

I ran it again and here are the netcdf files I see:

-rw-rw-r-- 1 ndk ndk  8216369200 Jun 21 06:46 run/mpascice.rst.0001-01-06_00000.nc
-rw-rw-r-- 1 ndk ndk 28630788832 Jun 21 06:46 run/mpaso.rst.0001-01-06_00000.nc
-rw-rw-r-- 1 ndk ndk  3767851844 Jun 21 06:47 run/g18to6.T62_oRRS18to6v3.GMPAS-IAF.cori-knl_intel.m38-jun1.n065t02.m64.st53P-restartb.cpl.r.0001-01-06-00000.nc

So it does appear that the larger one somehow got deleted which was certainly by accident as I was cleaning up.

Trying to restart from this.

xmlchange CONTINUE_RUN=TRUE
case.submit

@jonbob
Copy link
Contributor

jonbob commented Jun 21, 2017

That looks correct, though you should also have some restart files for the data models. We'll cross our fingers

@ndkeen
Copy link
Contributor

ndkeen commented Jun 21, 2017

OK, it worked. Read in restarts, ran another 5 days, and exited. And wrote another set of restarts.

g18to6.T62_oRRS18to6v3.GMPAS-IAF.cori-knl_intel.m38-jun1.n065t02.m64.st53P-restartb

Sorry about the scare.

@worleyph
Copy link
Contributor Author

Did ocean warning messages go away for the continuation run?

@ndkeen
Copy link
Contributor

ndkeen commented Jun 21, 2017

No they are still being written

@worleyph
Copy link
Contributor Author

So, @vanroekel , @ndkeen 's restart test is a failure? Anything else we should try? Should we modify the code to output more details on the failure, e.g. what the values (being read in) that trigger the warning look like?

@ndkeen
Copy link
Contributor

ndkeen commented Jun 22, 2017

I did NOT remove them from the first run, but looking at the timestamps, most of them are written during the 2nd job. There are 32 log.ocean.*err files that have timestamp of "5:30" and 572 files with a stamp of "14:00". So I think they are still being written, but may not be exactly the same ranks.

@jonbob
Copy link
Contributor

jonbob commented Jun 22, 2017

@worleyph - I don't think that means @ndkeen's test is a failure at all. I filed an issue in mpas to make these warnings appear as warnings instead of errors. We looked more at what spawns the errors/warnings, and it's minor differences between the sum of level thicknesses and the sea surface height. So as soon as we have some extra cycles, we'll:

  • make sure these are issued as warnings instead of errors

  • modify the limits of the differences that constitute a real issue

Do you want me to point you at the corresponding mpas issue?

@ndkeen
Copy link
Contributor

ndkeen commented Jun 22, 2017

Note: I missed an important "NOT" in my last comment. I did not remove the log files between the runs.

@worleyph
Copy link
Contributor Author

We looked more at what spawns the errors/warnings, and it's minor differences between the sum of level thicknesses and the sea surface height.

Thanks.

Do you want me to point you at the corresponding mpas issue?

No thanks.

I still do want someone to update the compset definitions so that we can use @vanroekel 's files as the defaults (and in production runs @golaz ?) so that we don't run into the slow ocean initialization problems again. Unless @jayeshkrishna can work out why the fallback PIO method is so slow (and make it faster), making this less important.

@jonbob
Copy link
Contributor

jonbob commented Jun 22, 2017

OK, that's a simple AMCE thing. I'll try to do that today.

@vanroekel
Copy link
Contributor

@jonbob I have a year 56 restart for an IC that you could use for acme if you'd like

/lustre/atlas1/cli115/world-shared/vanroekel/mpaso.rst.0056-01-01_0000.nc and mpascice.rst.0056-01-01.nc

@worleyph
Copy link
Contributor Author

@jonbob , won't be BFB :-) . Will anyone else care about this?

@jonbob
Copy link
Contributor

jonbob commented Jun 22, 2017

no - it's not being tested yet. Or ever...

@jonbob
Copy link
Contributor

jonbob commented Jun 22, 2017

@vanroekel - thanks. Are those files already on the svn server? If so, I'll update to just point at them...

@vanroekel
Copy link
Contributor

vanroekel commented Jun 22, 2017 via email

@jonbob
Copy link
Contributor

jonbob commented Jun 22, 2017

@vanroekel - but xtime has been yanked out?

@vanroekel
Copy link
Contributor

vanroekel commented Jun 22, 2017 via email

@jonbob
Copy link
Contributor

jonbob commented Jun 22, 2017

perfect...

@jonbob
Copy link
Contributor

jonbob commented Jun 22, 2017

@worleyph and @ndkeen - I've pushed @vanroekel's updated restart files to the svn data server but renamed them to be consistent with our previous convention. And I'm pulling together a PR to point at them by default, at least for the A_WCYCLXXXXS compsets that use the _ICG grids. However, it won't help with the corresponding G-cases. Luke and I were discussing the possibility of creating something like GMPAS-XXXS compsets that would also start from spunup initial condition sets. Do you think that would be useful?

@jonbob
Copy link
Contributor

jonbob commented Jun 26, 2017

OK, the PR with the new IC files has been fulfilled. That should solve issues for the A_WCYCL cases, but not G-cases

@worleyph
Copy link
Contributor Author

@jonbob and/or @vanroekel , I assume that this issue can be closed now? Can you confirm and do so?

Thanks.

@jonbob
Copy link
Contributor

jonbob commented Oct 2, 2017

@worleyph - I do believe the original reason for this issue has been resolved. Of course, new initial conditions files are now required because of all the recent changes to ocn/ice -- but that's a different matter and does not need to be tracked.

@jonbob jonbob closed this as completed Oct 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants