-
Notifications
You must be signed in to change notification settings - Fork 365
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPAS fail in new A_WCYCL2000 case on cori #698
Comments
The above error was for the case that was created as:
|
Was this also seen only after setting finidat=' ' ? |
finidat was not modified. So, the land model successfully read the initial condition netcdf file. |
Huh. Then why could edison not read it correctly as in #699 ? |
See the following #699 (comment) |
I just did an update on master to get PR #700, and I still see the same failure as before. (same thing happened on edison) TO be clear: 089: pio_support::pio_die:: myrank= -1 : ERROR: ionf_mod.F90: 235 : |
In an interactive queue, the model successfully completed multiple steps (atm = 32 steps, land = 31 steps) till the job ran out of wall clock time. I used 33e40ec for this exercise and below are my steps:
|
I also tried what Gautam tried (ie, using salloc on cori) and I get the same failure as I do in batch. |
OK, just to be sure this wasn't a difference between the way Gautam and I were creating the test and launchign the job, I also tried using newcase: ./create_newcase -case /global/cscratch1/sd/ndk/acme_scratch/A_WCYCL2000.ne30_oEC.corip1.33e40ec -compset A_WCYCL2000 -res ne30_oEC -mach corip1 -compiler intel -proj acme I then tried both using salloc (and using interactive Q) as well as simple c.submit. I get the same failures as I reported previously. Gautum, perhaps there is an environment difference. Maybe you and I can make sure we are starting with the same things and if we find an issue, find a way to guard against it in the future. |
Gautam had suggested I try adding a finidat=' ' in user clm. This allows the run to continue (on edison as well). I ran the job for over an hour on cori and it finally failed. last lines of coupler log: in cesm.log: |
@ndkeen, @amametjanov : The CLM initial condition file on NERSC didn't have correct permissions for others/group. Thus, your jobs were failing to read the finidat file on Cori and Edison, while my jobs were able to read the initial condition file. I have updated the file permissions, please try submitting jobs again without setting |
Thanks Gautam. Good catch. I was debugging it last night and I also found that it was the finidat file, but it was not obvious it was a permission problem. You'd think PIO would be a little better about telling you something that obvious. Since I was already building debug=true, I ran another one quickly and it does get past the errors I had been reporting. However, now I see a different error! :( 000: MCT::m_Router::initp_: GSMap indices not increasing...Will correct |
@akturner is the designated debug person for MPAS-CICE |
Also pinging @toddringler |
hmmmm, interesting. Is anymore debug info available? Appears to fail in a call to shortwave_dEdd, but the rest of the call stack isn't listed. |
Just recording this here as well. @akturner : I can look around for more info, but there mot be much more |
Not sure I understand those line numbers: |
Just to help make sure we are looking at same code, My bld/ice/source/core_cice/column/ice_shortwave.f90 has a line 1010
|
Exactly. These seem to the two lines in the call stack: But shortwave_dEdd does not appear to be called from compute_shortwave_trcr |
I had reported above that the outta-box run (optimized) on cori was 'working' but stopped after running out of walltime. I let it run longer and it is failing here: 000: pionfput_mod.F90 128 1 64 0001-01-01_00:00:00 Which looks v similar to how it failed once on edison (and for Az) |
So the sea ice was not the issue? |
Well, I just rebuilt with DEBUG=TRUE and re-ran. Same issue. So if you want to ignore that it fails in debug... ? I did find more info. There is a core file! Using host libthread_db library "/lib64/libthread_db.so.1". |
Are line numbers available from the backtrace? |
Please see if PR #704 will fix allow a no-debug run on Cori. If so, close this and open a new issue for the debugging problem. |
@ndkeen - the submit script failed for me -- the only thing in the cesm log was "sh: srun: command not found". Did you have to change the machine files to get this to run? |
My non-debug run on Cori successfully completed with
Non-default settings:
|
great! What's this about commenting out the module rm's? Is something that should always be done for all ACME users on cori? |
Unloading and loading of specific module versions in env_mach_specific is leading to job failures with |
I'm trying this again out-of-the-box. So far, it's been running for 8 minutes, so I'm at least past the point where others have seen the "srun issue". I'm running in batch mode (as with create_test). I will continue debugging so that I can reproduce what others see. Also, I'm not sure how long this job will take or if it is safe/advisable to increase the number of nodes/cores. |
I ran a short run in debug mode that finished successfully overnight. And I got past the "srun not found" issue by adding the path to it in the run script. I'll try this morning with a larger number of cores and see if it will run for a few months. |
With the latest PE changes and my "fix" to the qx() arg, this test is finally successful for me. |
fix issue with pylint code_checker wasnt being run from the correct directory causing some silent failures. Fixed that and two pylint issues found after it was fixed Test suite: Test baseline: Test namelist changes: Test status: [bit for bit, roundoff, climate changing] Fixes #698 User interface changes?: Code review:
@ndkeen reports the following:
cesm.exe 000000000AB554F5 Unknown Unknown Unknown
cesm.exe 000000000AB532B7 Unknown Unknown Unknown
cesm.exe 000000000AB03974 Unknown Unknown Unknown
cesm.exe 000000000AB03786 Unknown Unknown Unknown
cesm.exe 000000000AA869C6 Unknown Unknown Unknown
cesm.exe 000000000AA9152E Unknown Unknown Unknown
cesm.exe 000000000A453250 Unknown Unknown Unknown
cesm.exe 000000000AA91450 Unknown Unknown Unknown
cesm.exe 000000000A453250 Unknown Unknown Unknown
cesm.exe 0000000008690280 ocn_gm_mp_ocn_gm_ 183 mpas_ocn_gm.f90
cesm.exe 000000000861D86F ocn_init_routines 635 mpas_ocn_init_routines.f90
cesm.exe 00000000084E7C0C ocn_forward_mode_ 308 mpas_ocn_forward_mode.f90
cesm.exe 0000000008243F95 ocn_core_mp_ocn_c 79 mpas_ocn_core.f90
cesm.exe 00000000079E228C ocn_comp_mct_mp_o 458 ocn_comp_mct.f90
cesm.exe 0000000000443AD4 component_mod_mp_ 229 component_mod.F90
cesm.exe 0000000000410F99 cesm_comp_mod_mp_ 1163 cesm_comp_mod.F90
cesm.exe 00000000004371E8 MAIN__ 102 cesm_driver.F90
cesm.exe 000000000040144E Unknown Unknown Unknown
cesm.exe 000000000AB5C4E1 Unknown Unknown Unknown
cesm.exe 00000000004012B5 Unknown Unknown
The text was updated successfully, but these errors were encountered: