Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPAS fail in new A_WCYCL2000 case on cori #698

Closed
rljacob opened this issue Feb 11, 2016 · 32 comments
Closed

MPAS fail in new A_WCYCL2000 case on cori #698

rljacob opened this issue Feb 11, 2016 · 32 comments

Comments

@rljacob
Copy link
Member

rljacob commented Feb 11, 2016

@ndkeen reports the following:

cesm.exe 000000000AB554F5 Unknown Unknown Unknown
cesm.exe 000000000AB532B7 Unknown Unknown Unknown
cesm.exe 000000000AB03974 Unknown Unknown Unknown
cesm.exe 000000000AB03786 Unknown Unknown Unknown
cesm.exe 000000000AA869C6 Unknown Unknown Unknown
cesm.exe 000000000AA9152E Unknown Unknown Unknown
cesm.exe 000000000A453250 Unknown Unknown Unknown
cesm.exe 000000000AA91450 Unknown Unknown Unknown
cesm.exe 000000000A453250 Unknown Unknown Unknown
cesm.exe 0000000008690280 ocn_gm_mp_ocn_gm_ 183 mpas_ocn_gm.f90
cesm.exe 000000000861D86F ocn_init_routines 635 mpas_ocn_init_routines.f90
cesm.exe 00000000084E7C0C ocn_forward_mode_ 308 mpas_ocn_forward_mode.f90
cesm.exe 0000000008243F95 ocn_core_mp_ocn_c 79 mpas_ocn_core.f90
cesm.exe 00000000079E228C ocn_comp_mct_mp_o 458 ocn_comp_mct.f90
cesm.exe 0000000000443AD4 component_mod_mp_ 229 component_mod.F90
cesm.exe 0000000000410F99 cesm_comp_mod_mp_ 1163 cesm_comp_mod.F90
cesm.exe 00000000004371E8 MAIN__ 102 cesm_driver.F90
cesm.exe 000000000040144E Unknown Unknown Unknown
cesm.exe 000000000AB5C4E1 Unknown Unknown Unknown
cesm.exe 00000000004012B5 Unknown Unknown

@bishtgautam
Copy link
Contributor

The above error was for the case that was created as:

./create_newcase -case A_WCYCL2000.ne30_oEC.corip1.2efd89d.debug \
-compset A_WCYCL2000 -res ne30_oEC -mach corip1 -compiler intel -proj acme
cd A_WCYCL2000.ne30_oEC.corip1.2efd89d.debug
./xmlchange -file env_build.xml -id DEBUG -val TRUE
./cesm_setup
./*.build

@rljacob
Copy link
Member Author

rljacob commented Feb 11, 2016

Was this also seen only after setting finidat=' ' ?

@bishtgautam
Copy link
Contributor

finidat was not modified. So, the land model successfully read the initial condition netcdf file.

@rljacob
Copy link
Member Author

rljacob commented Feb 11, 2016

Huh. Then why could edison not read it correctly as in #699 ?

@bishtgautam
Copy link
Contributor

See the following #699 (comment)

@ndkeen
Copy link
Contributor

ndkeen commented Feb 11, 2016

I just did an update on master to get PR #700, and I still see the same failure as before. (same thing happened on edison)

TO be clear:
Last line of couple log: (component_init_cc:mct) : Initialize component lnd

089: pio_support::pio_die:: myrank= -1 : ERROR: ionf_mod.F90: 235 :
089: NetCDF: Not a valid ID
093: pio_support::pio_die:: myrank= -1 : ERROR: ionf_mod.F90: 235 :
093: NetCDF: Not a valid ID
001: Image PC Routine Line Source
001: cesm.exe 0000000003F596EE Unknown Unknown Unknown
001: cesm.exe 0000000002D09941 pio_support_mp_pi 120 pio_support.F90
001: cesm.exe 0000000002D078D2 pio_utils_mp_chec 74 pio_utils.F90
001: cesm.exe 0000000002E314AA ionf_mod_mp_open_ 235 ionf_mod.F90
001: cesm.exe 0000000002CF7586 piolib_mod_mp_pio 2755 piolib_mod.F90
001: cesm.exe 0000000001E5BC5B ncdio_pio_mp_ncd_ 188 ncdio_pio.F90.in
001: cesm.exe 0000000001E8B0CA restfilemod_mp_re 760 restFileMod.F90
001: cesm.exe 0000000001E8CEAE restfilemod_mp_re 437 restFileMod.F90
001: cesm.exe 0000000001DC476C clm_initializemod 910 clm_initializeMod.F90
001: cesm.exe 0000000001DB00BD lnd_comp_mct_mp_l 235 lnd_comp_mct.F90
001: cesm.exe 000000000041C6DD component_mod_mp_ 229 component_mod.F90
001: cesm.exe
001: 000000000040C0DE cesm_comp_mod_mp_ 1151 cesm_comp_mod.F90
001: cesm.exe 0000000000417818 MAIN__ 102 cesm_driver.F90
001: cesm.exe 000000000040134E Unknown Unknown Unknown
001: cesm.exe 00000000040330A1 Unknown Unknown Unknown
001: cesm.exe 00000000004011B5 Unknown Unknown Unknown

@bishtgautam
Copy link
Contributor

In an interactive queue, the model successfully completed multiple steps (atm = 32 steps, land = 31 steps) till the job ran out of wall clock time. I used 33e40ec for this exercise and below are my steps:

GIT_HASH=`git log -n 1 --format=%h`
COMPSET=A_WCYCL2000
RES=ne30_oEC
MACH=corip1
CASE=$COMPSET.$RES.$MACH.$GIT_HASH

# Build the case
./create_newcase -case $CASE -compset $COMPSET -res $RES -mach $MACH -compiler $COMPILER -proj $PROJ
cd $CASE
./cesm_setup
./$CASE.build

# Interactive queue
salloc -N 4 -p debug -A acme
cd /global/cscratch1/sd/gbisht/acme_scratch/A_WCYCL2000.ne30_oEC.corip1.33e40ec/run
srun  --label  -n 128  -c 1 ../bld/cesm.exe

@ndkeen
Copy link
Contributor

ndkeen commented Feb 12, 2016

I also tried what Gautam tried (ie, using salloc on cori) and I get the same failure as I do in batch.
Next I will see if it matters which way we are creating the case. I'm using create_test and he is using create_newcase.

@ndkeen
Copy link
Contributor

ndkeen commented Feb 12, 2016

OK, just to be sure this wasn't a difference between the way Gautam and I were creating the test and launchign the job, I also tried using newcase:

./create_newcase -case /global/cscratch1/sd/ndk/acme_scratch/A_WCYCL2000.ne30_oEC.corip1.33e40ec -compset A_WCYCL2000 -res ne30_oEC -mach corip1 -compiler intel -proj acme

I then tried both using salloc (and using interactive Q) as well as simple c.submit.

I get the same failures as I reported previously.

Gautum, perhaps there is an environment difference. Maybe you and I can make sure we are starting with the same things and if we find an issue, find a way to guard against it in the future.

@ndkeen
Copy link
Contributor

ndkeen commented Feb 12, 2016

Gautam had suggested I try adding a finidat=' ' in user clm. This allows the run to continue (on edison as well). I ran the job for over an hour on cori and it finally failed.

last lines of coupler log:
Write history file at 10106 0
(seq_io_wopen) create file
SMS.ne30_oEC.A_WCYCL2000.corip1_intel.m00.cpl.hi.0001-01-06-00000.nc
tStamp_write: model date = 10106 0 wall clock = 2016-02-11 22:01:23 avg dt = 733.42 dt = 716.94

in cesm.log:
000: pionfput_mod.F90 128 1 5 64 1
000: 0001-01-06_00:00:00
000: *** glibc detected *** /global/cscratch1/sd/ndk/acme_scratch/SMS.ne30_oEC.A_WCYCL2000.corip1_intel.m00/bld/cesm.exe: double free or corruption (!prev): 0x0000000035a2ec40 ***
000: ======= Backtrace: =========
000: [0x4060474]

@bishtgautam
Copy link
Contributor

@ndkeen, @amametjanov : The CLM initial condition file on NERSC didn't have correct permissions for others/group. Thus, your jobs were failing to read the finidat file on Cori and Edison, while my jobs were able to read the initial condition file. I have updated the file permissions, please try submitting jobs again without setting finidat = ' ' in the user_nl_clm file.

@ndkeen
Copy link
Contributor

ndkeen commented Feb 12, 2016

Thanks Gautam. Good catch. I was debugging it last night and I also found that it was the finidat file, but it was not obvious it was a permission problem. You'd think PIO would be a little better about telling you something that obvious.

Since I was already building debug=true, I ran another one quickly and it does get past the errors I had been reporting. However, now I see a different error! :(

000: MCT::m_Router::initp_: GSMap indices not increasing...Will correct
000: forrtl: error (65): floating invalid
000: Image PC Routine Line Source
000: cesm.exe 0000000008C78F05 Unknown Unknown Unknown
000: cesm.exe 0000000008C76CC7 Unknown Unknown Unknown
000: cesm.exe 0000000008C268E4 Unknown Unknown Unknown
000: cesm.exe 0000000008C266F6 Unknown Unknown Unknown
000: cesm.exe 0000000008BA9936 Unknown Unknown Unknown
000: cesm.exe 0000000008BB4777 Unknown Unknown Unknown
000: cesm.exe 0000000008587930 Unknown Unknown Unknown
000: cesm.exe 000000000436C60A ice_shortwave_mp_ 1010 ice_shortwave.f90
000: cesm.exe 000000000435619D ice_colpkg_mp_col 2569 ice_colpkg.f90
000: cesm.exe 000000000414DB1C cice_column_mp_co 2264 mpas_cice_column.f90
000: cesm.exe 0000000004192E7C cice_column_mp_ci 498 mpas_cice_column.f90
000: cesm.exe 0
000: 00000000427428F cice_initialize_m 145 mpas_cice_initialize.f90
000: cesm.exe 000000000404D606 ice_comp_mct_mp_i 744 ice_comp_mct.f90
000: cesm.exe 0000000000458D4E component_mod_mp_ 1044 component_mod.F90
000: cesm.exe 0000000000424B96 cesm_comp_mod_mp_ 2548 cesm_comp_mod.F90
000: cesm.exe 0000000000437117 MAIN__ 107 cesm_driver.F90
000: cesm.exe 000000000040134E Unknown Unknown Unknown
000: cesm.exe 0000000008C7FEF1 Unknown Unknown Unknown
000: cesm.exe 00000000004011B5 Unknown Unknown Unknown
srun: error: nid00056: tasks 0,4-7,29: Aborted

@rljacob rljacob assigned akturner and unassigned douglasjacobsen Feb 12, 2016
@rljacob
Copy link
Member Author

rljacob commented Feb 12, 2016

@akturner is the designated debug person for MPAS-CICE

@rljacob
Copy link
Member Author

rljacob commented Feb 12, 2016

Also pinging @toddringler

@akturner
Copy link
Contributor

hmmmm, interesting. Is anymore debug info available? Appears to fail in a call to shortwave_dEdd, but the rest of the call stack isn't listed.

@ndkeen
Copy link
Contributor

ndkeen commented Feb 12, 2016

Just recording this here as well.
After rebuilding optimized (DEBUG=FALSE), and running on cori for 30 min, it ran until it timed-out. so obviously not failing as I reported above. I will re-submit for longer time in reg Q

@akturner : I can look around for more info, but there mot be much more

@akturner
Copy link
Contributor

Not sure I understand those line numbers:
000: cesm.exe 000000000435619D ice_colpkg_mp_col 2569 ice_colpkg.f90 seems to refer to a call to compute_shortwave_trcr while
000: cesm.exe 000000000436C60A ice_shortwave_mp_ 1010 ice_shortwave.f90 doesnt seem to be within compute_shortwave_trcr

@ndkeen
Copy link
Contributor

ndkeen commented Feb 12, 2016

Just to help make sure we are looking at same code,

My bld/ice/source/core_cice/column/ice_shortwave.f90 has a line 1010
that looks like so:

     call shortwave_dEdd(n_aero,        n_zaero,        &

@akturner
Copy link
Contributor

Exactly. These seem to the two lines in the call stack:
https://github.com/ACME-Climate/MPAS/blob/403012d96cc0fb646215e0f3bbfc66b64fbf064d/src/core_cice/column/ice_shortwave.F90#L1010: call shortwave_dEdd(n_aero, n_zaero, &
https://github.com/ACME-Climate/MPAS/blob/403012d96cc0fb646215e0f3bbfc66b64fbf064d/src/core_cice/column/ice_colpkg.F90#L2569: call compute_shortwave_trcr(n_algae, nslyr, &

But shortwave_dEdd does not appear to be called from compute_shortwave_trcr

@ndkeen
Copy link
Contributor

ndkeen commented Feb 12, 2016

I had reported above that the outta-box run (optimized) on cori was 'working' but stopped after running out of walltime. I let it run longer and it is failing here:

000: pionfput_mod.F90 128 1 64 0001-01-01_00:00:00
000: WARNING: Rearr optional argument is a pio2 feature, ignored in pio1
000: WARNING: Rearr optional argument is a pio2 feature, ignored in pio1
000: WARNING: Rearr optional argument is a pio2 feature, ignored in pio1
000: WARNING: Rearr optional argument is a pio2 feature, ignored in pio1
000: WARNING: Rearr optional argument is a pio2 feature, ignored in pio1
000: WARNING: Rearr optional argument is a pio2 feature, ignored in pio1
000: *** glibc detected *** /global/cscratch1/sd/ndk/acme_scratch/A_WCYCL2000.ne30_oEC.corip1.33e40ec/bld/cesm.exe: double free or corruption (!prev): 0x0000000035aca200 ***
000: ======= Backtrace: =========
000: [0x4060574]
000: [0x4065387]
000: [0x3fa2b0b]
000: [0x1919cdd]
000: [0x19195ee]
000: [0x18ed0f5]
000: [0x18ff718]
000: [0x2698f37]
000: [0x41a417]
000: [0x40ad59]
000: [0x41783d]
000: [0x40134e]
000: [0x40331a1]
000: [0x4011b5]
000: ======= Memory map: ========
000: 00400000-04b3d000 r-xp 00000000 68f:70e56 144121261243536941 /global/cscratch1/sd/ndk/acme_scratch/A_WCYCL2000.ne30_oEC.corip1.33e40ec/bld/cesm.exe
000: 04d3d000-0631f000 rwxp 0473d000 68f:70e56 144121261243536941 /global/cscratch1/sd/ndk/acme_scratch/A_WCYCL2000.ne30_oEC.corip1.33e40ec/bld/cesm.exe

Which looks v similar to how it failed once on edison (and for Az)

@akturner
Copy link
Contributor

So the sea ice was not the issue?

@ndkeen
Copy link
Contributor

ndkeen commented Feb 12, 2016

Well, I just rebuilt with DEBUG=TRUE and re-ran. Same issue. So if you want to ignore that it fails in debug... ?

I did find more info. There is a core file!

Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/global/cscratch1/sd/ndk/acme_scratch/SMS.ne30_oEC.A_WCYCL2000.corip1_intel.m02'.
Program terminated with signal 6, Aborted.
#0 0x0000000008b8e25b in raise (sig=) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:42
42 ../nptl/sysdeps/unix/sysv/linux/pt-raise.c: No such file or directory.
(gdb) where
#0 0x0000000008b8e25b in raise (sig=) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:42
#1 0x0000000008c85e21 in abort () at abort.c:92
#2 0x0000000008bb4d5e in for__signal_handler ()
#3
#4 0x000000000436c60a in ice_shortwave_mp_run_dedd_ ()
#5 0x000000000435619d in ice_colpkg_mp_colpkg_step_radiation_ ()
#6 0x000000000414db1c in cice_column_mp_column_radiation_ ()
#7 0x0000000004192e7c in cice_column_mp_cice_init_column_shortwave_ ()
#8 0x000000000427428f in cice_initialize_mp_cice_init_post_clock_advance_ ()
#9 0x000000000404d606 in ice_comp_mct_mp_ice_run_mct_ ()
#10 0x0000000000458d4e in component_mod::component_run (eclock=..., comp=..., infodata=...,
seq_flds_x2c_fluxes='Faxa_rain:Faxa_snow:Faxa_lwdn:Faxa_swndr:Faxa_swvdr:Faxa_swndf:Faxa_swvdf:Faxa_bcphidry:Faxa_bcphodry:Faxa_bcphiwet:Faxa_ocphidry:Faxa_ocphodry:Faxa_ocphiwet:Faxa_dstwet1:Faxa_dstwet2:Faxa_dstwet3:Fax'...,
seq_flds_c2x_fluxes='Faii_swnet:Fioi_swpen:Faii_taux:Fioi_taux:Faii_tauy:Fioi_tauy:Faii_lat:Faii_sen:Faii_lwup:Faii_evap:Fioi_melth:Fioi_meltw:Fioi_salt', ' ' <repeats 3965 times>, comp_prognostic=4294967295, comp_num=4,
timer_barrier='CPL:ICE_RUN_BARRIER', timer_comp_run='CPL:ICE_RUN', run_barriers=.FALSE., ymd=10101, tod=1800,
comp_layout='\000' <repeats 32 times>, .tmp.SEQ_FLDS_X2C_FLUXES.len_V$3592=4096, .tmp.SEQ_FLDS_C2X_FLUXES.len_V$3595=4096,
.tmp.TIMER_BARRIER.len_V$359a=19, .tmp.TIMER_COMP_RUN.len_V$359d=11, .tmp.COMP_LAYOUT.len_V$35a3=32)
at /global/project/projectdirs/acme/ndk/cori-work/wcy/cime/driver_cpl/driver/component_mod.F90:1044
#11 0x0000000000424b96 in cesm_comp_mod::cesm_run ()
at /global/project/projectdirs/acme/ndk/cori-work/wcy/cime/driver_cpl/driver/cesm_comp_mod.F90:2548
#12 0x0000000000437117 in cesm_driver ()
at /global/project/projectdirs/acme/ndk/cori-work/wcy/cime/driver_cpl/driver/cesm_driver.F90:107
#13 0x000000000040134e in main ()
(gdb)

@akturner
Copy link
Contributor

Are line numbers available from the backtrace?

@jonbob
Copy link
Contributor

jonbob commented Feb 16, 2016

@ndkeen and @akturner - I have some test runs submitted to cori to see if I can make any sense of this. I'll let you know once I find anything.

@rljacob
Copy link
Member Author

rljacob commented Feb 16, 2016

Please see if PR #704 will fix allow a no-debug run on Cori. If so, close this and open a new issue for the debugging problem.

@jonbob
Copy link
Contributor

jonbob commented Feb 16, 2016

@ndkeen - the submit script failed for me -- the only thing in the cesm log was "sh: srun: command not found". Did you have to change the machine files to get this to run?

@amametjanov
Copy link
Member

My non-debug run on Cori successfully completed with

(seq_mct_drv): ===============       SUCCESSFUL TERMINATION OF CPL7-CESM ===============
(seq_mct_drv): ===============        at YMD,TOD =    10106       0      ===============
(seq_mct_drv): ===============  # simulated days (this run) =     5.000  ===============
(seq_mct_drv): ===============  compute time (hrs)          =     0.184  ===============
(seq_mct_drv): ===============  # simulated years / cmp-day =     1.790  ===============
(seq_mct_drv): ===============  pes min memory highwater  (MB)  482.518  ===============
(seq_mct_drv): ===============  pes max memory highwater  (MB)  733.397  ===============
(seq_mct_drv): ===============  pes min memory last usage (MB)   -0.001  ===============
(seq_mct_drv): ===============  pes max memory last usage (MB)   -0.001  ===============

Non-default settings:

  • ntasks_* = 1024
  • nthrds_* = 1
  • max_tasks_per_node = pes_pe_node = 2
  • MPAS-O patch from the run on Edison that is now in master
  • all module rm's in env_mach_specific commented out

@rljacob rljacob removed this from the v1.0 Alpha.3 milestone Feb 16, 2016
@rljacob
Copy link
Member Author

rljacob commented Feb 16, 2016

great! What's this about commenting out the module rm's? Is something that should always be done for all ACME users on cori?

@amametjanov
Copy link
Member

Unloading and loading of specific module versions in env_mach_specific is leading to job failures with sh: srun: command not found error in cesm.log for @bishtgautam, @jonbob and @amametjanov (all with bash shell). When using csh, jobs run okay. Reopened #618 to track the module issue there.

@ndkeen
Copy link
Contributor

ndkeen commented Feb 17, 2016

I'm trying this again out-of-the-box. So far, it's been running for 8 minutes, so I'm at least past the point where others have seen the "srun issue". I'm running in batch mode (as with create_test). I will continue debugging so that I can reproduce what others see. Also, I'm not sure how long this job will take or if it is safe/advisable to increase the number of nodes/cores.

@jonbob
Copy link
Contributor

jonbob commented Feb 17, 2016

I ran a short run in debug mode that finished successfully overnight. And I got past the "srun not found" issue by adding the path to it in the run script. I'll try this morning with a larger number of cores and see if it will run for a few months.

rljacob added a commit that referenced this issue Feb 17, 2016
Provide default pe layouts for the WCYCL case that work out-of-the-box on
edison, cori and titan.

[BFB] - Bit-For-Bit

CSG-146, CSG-147, CSG-143
Fixes #698, Fixes #699, Fixes #701
@ndkeen
Copy link
Contributor

ndkeen commented Feb 18, 2016

With the latest PE changes and my "fix" to the qx() arg, this test is finally successful for me.

rljacob pushed a commit that referenced this issue Feb 27, 2017
fix issue with pylint

code_checker wasnt being run from the correct directory causing some silent failures.
Fixed that and two pylint issues found after it was fixed

Test suite:
Test baseline:
Test namelist changes:
Test status: [bit for bit, roundoff, climate changing]

Fixes #698

User interface changes?:

Code review:
jgfouca pushed a commit that referenced this issue Feb 27, 2018
Provide default pe layouts for the WCYCL case that work out-of-the-box on
edison, cori and titan.

[BFB] - Bit-For-Bit

CSG-146, CSG-147, CSG-143
Fixes #698, Fixes #699, Fixes #701
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants