Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restart problem with MPASSI prescribed ice mode #3936

Closed
wlin7 opened this issue Nov 9, 2020 · 16 comments
Closed

Restart problem with MPASSI prescribed ice mode #3936

wlin7 opened this issue Nov 9, 2020 · 16 comments
Assignees

Comments

@wlin7
Copy link
Contributor

wlin7 commented Nov 9, 2020

MPASSI prescribed ice mode is being tested F2010 with compset F2010SC5-CMIP6-MPASSI and grid ne30_r05_oECv3. There are a several problems related to restart.

  1. It does not work with netcdf typename. When PIO_ICE_TYPENAME=netcdf, and when attempting to create/write restart file, there are error message
    FATAL ERROR: NetCDF: Operation not allowed in define mode (/qfs/people/linw288/E3SM/integration/E3SM.testing/externals/scorpio/src/clib/pio_file.c: 349)

Not sure writing which mpassi restart file causing the problem.

The time stamp format for mpassi.rst.am.timeSeriesStatsMonthly (and mpassi.rst.am.timeSeriesStatsDaily) is not consistent with model time. For example, for a test run starting from 1980-01-01, the time stamp for the above two mpassi.rst file is 0001-01-01.

  1. The run is ok when using PIO_ICE_TYPENAME=pnetcdf. But it would still produce error message like
    ERROR: Opening file (20201105.alpha5_v1p-1.F2010-mpassi.ne30pg2_r05_oECv3.compy.mpassi.rst.am.timeSeriesStatsMonthly.0001-01-01_00000.nc) with iotype 1 (PIO_IOTYPE_PNETCDF) failed. The low level I/O library call failed. NetCDF: Unknown file format (error num=-51), (/qfs/people/linw288/E3SM/integration/E3SM.testing/externals/scorpio/src/clib/pioc_support.c:2956)

  2. Restarting run using restart files generated from '2' above, the results are non-BFB.

@wlin7 wlin7 assigned wlin7, akturner and jonbob and unassigned wlin7 Nov 9, 2020
@rljacob
Copy link
Member

rljacob commented Nov 9, 2020

What machine are you trying this on?

@jonbob
Copy link
Contributor

jonbob commented Nov 9, 2020

Thanks for tracking this down, @wlin7 - let me try this as well

@wlin7
Copy link
Contributor Author

wlin7 commented Nov 10, 2020

Thanks @rljacob . This just reminds me of some additional info. The tests were done on compy and cori-knl. Same behavior. There is one difference. cori-knl gave a back tracing when reporting "FATAL ERROR: NetCDF: Operation not allowed in define mode".


  0: e3sm.exe           00000000035A02E1  mpas_io_streams_m        4510  mpas_io_streams.f90
   0: e3sm.exe           0000000003389979  mpas_stream_manag        3324  mpas_stream_manager.f90
   0: e3sm.exe           00000000033891AA  mpas_stream_manag        2768  mpas_stream_manager.f90
   0: e3sm.exe           0000000002FE56E8  ice_comp_mct_mp_i        1127  ice_comp_mct.f90
   0: e3sm.exe           0000000000423239  component_mod_mp_         737  component_mod.F90
   0: e3sm.exe           0000000000403F3F  cime_comp_mod_mp_        2688  cime_comp_mod.F90
   0: e3sm.exe           0000000000422E2D  MAIN__                    153  cime_driver.F90

@wlin7
Copy link
Contributor Author

wlin7 commented Nov 10, 2020

Thanks for tracking this down, @wlin7 - let me try this as well

Thanks for looking into this, @jonbob . I completed a 6-year simulation that does not involve restart run. The results look reasonable. The longer AMIP simulation will require several continuation runs, so I am putting it on hold for now.

@jonbob
Copy link
Contributor

jonbob commented Nov 10, 2020

It works fine with RUN_STARTDATE 0001-01-01, so I'm guessing it has something to do with the seaice model not picking up the change correctly

@wlin7
Copy link
Contributor Author

wlin7 commented Nov 10, 2020

@jonbob , did you mean your test with RUN_STARTDATE 0001-01-01 does not have some of the issues? My test with F2010 compset started from 0001-01-01 but still had the problems.

@jonbob
Copy link
Contributor

jonbob commented Nov 10, 2020

My test with RUN_STARTDATE 0001-01-01 ran fine -- I didn't check and see if the results were BFB. My test with RUN_STARTDATE 2010-01-01 also ran fine, but I'm seeing if it will restart now. Can you point me to your case so I can compare?

@jonbob
Copy link
Contributor

jonbob commented Nov 10, 2020

Update -- the restart from 2010-01-01 also completed. I'll try an ERS test

@tangq
Copy link
Contributor

tangq commented Nov 10, 2020

Some BFB restart issue I just saw in the coupled RRM case. It may or may not be related to this restart problem with the MPASSI prescribed ice mode.

The coupled RRM restart runs are non-BFB and likely caused by coupling between different components. See the "non-BFB" session on this page for details.

@wlin7
Copy link
Contributor Author

wlin7 commented Nov 10, 2020

Thanks for the update, @jonbob . My rundir on cori is
/global/cscratch1/sd/wlin/E3SM_simulations/20201108.alpha5_v1p-1.F2010-mpassi.ne30pg2_r05_oECv3.cori-knl/run

Multiple runs were done using that case, so may not be straightforward to see the problem with the current state of what in there. But you can see the error in e3sm.log.36009612.201108-083603 when PIO_ICE_TYPENAME=netcdf.

Are your tests using PIO_TYPENME=pnetcdf for all? The model runs ok with pnetcdf. I used template script from @xuezhengllnl for running F2010 case on compy. The script explicitly set PIO_TYPENAME="netcdf". That is how these problems are initially exposed. Can you also try change PIO_ICE_TYPENAME="netcdf" to see if you can reproduce the problem?

Also in your test with RUN_STARTDATE=2010-01-01, what are the time stamp for mpassi.rst.atm.timeSeriesStatsMonthly (and mpassi.rst.am.timeSeriesStatsDaily)?

BTW, did you set anything in user_nl_mpassi? That may impact how ice fields are saved. My run has it empty.

@jonbob
Copy link
Contributor

jonbob commented Nov 10, 2020

@wlin7 - my ERS tests passed, both ERS.ne30pg2_r05_oECv3.F2010SC5-CMIP6-MPASSI.compy_intel and ERS.ne30_oECv3.F2010SC5-CMIP6-MPASSI.compy_intel. I'll check about using netcdf instead of pnetcdf, but is there a compelling reason you want to do this?

I'm also leaving user_nl_mpassi empty, just running straight out of the box.

@jonbob
Copy link
Contributor

jonbob commented Nov 10, 2020

@wlin7 - I am seeing the issue you're having, but only when we force seaice to use netcdf. Let me see if I can track down the problem. In the meantime, is there a reason not to use pnetcdf?

@wlin7
Copy link
Contributor Author

wlin7 commented Nov 10, 2020

@jonbob , I would be fine with using pnetcdf, but it should be helpful if it can also run with netcdf typename. Don't know if netcdf would be recommended over pentcdf under certain circumstances. There could be personal computers that do not have pnetcdf, but powerful enough for running SCM.

When running with pnetcdf, did you see error type #2 in e3sm.log? I wonder in my case, if it is because the file was created during the first (and failed) attempt of running with netcdf, which was then not recognizable when running with pnetcdf. I will clean up the rundir and do a fresh run with pnetcdf.

@jonbob
Copy link
Contributor

jonbob commented Nov 10, 2020

I saw the "define mode" problem, so I'll see if building it with debug on gives any more information

@wlin7
Copy link
Contributor Author

wlin7 commented Nov 11, 2020

@jonbob , I learned from @xuezhengllnl there was a time pnetcdf not working on compy. That was the reason there was a reset to netcdf in the run_e3sm script. pnetcdf is ok now for compy. At least for production machine, there is no reason not to use pnetcdf. That said, thanks for continuing to look into it.

@wlin7
Copy link
Contributor Author

wlin7 commented Nov 12, 2020

Hi @jonbob , just to update you that with a clean run using pnetcdf, the problem #2 and #3 as described above do not appear. And although timestamps for mpassi.rst.am.timeSeriesStat files are not consistent with that for other files, it does not affect the simulation and does not affect BFB reproducibility.

The issue now is really just the #1. Sorry for the misleading information due to the testing sequence I used.

@wlin7 wlin7 closed this as completed Apr 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants