UFSATM failure with Intel with debug flags #58

rsdunlapiv · 2020-01-13T22:26:23Z

We received the following error running UFSATM with Intel 19/MPT on Cheyenne with physics GFSv15p2. We are running in debug mode and a SIGFPE was caught inside the physics (see stack trace below).

CIME test:
SMS_D.C96.GFSv15p2.cheyenne_intel

Modules:

module load ncarenv/1.2 intel/19.0.2 esmf_libs mkl
module use /glade/work/turuncu/PROGS/modulefiles/esmfpkgs/intel/19.0.2
module load esmf-8.0.0-ncdfio-mpt-g mpt/2.19 netcdf/4.7.1 pnetcdf/1.11.1 ncarcompilers/0.5.0

Hash of UFS weather model:
ufs-community/ufs-weather-model@bde62f9

We can provide more information, as needed, on the initial conditions.

Stack trace:

37:MPT: #1  0x00002ad4deaafdb6 in mpi_sgi_system (
37:MPT: #2  MPI_SGI_stacktraceback (
37:MPT:     header=header@entry=0x7ffcd7c52b00 "MPT ERROR: Rank 37(g:37) received signal SIGFPE(8).\n\tProcess ID: 40083, Host: r2i3n29, Program: /glade/scratch/jedwards/SMS_D.C96.GFSv15p2.cheyenne_intel.snoop/bld/ufs.exe\n\tMPT Version: HPE MPT 2.19  "...) at sig.c:340
37:MPT: #3  0x00002ad4deaaffb2 in first_arriver_handler (signo=signo@entry=8, 
37:MPT:     stack_trace_sem=stack_trace_sem@entry=0x2ad4eb6a0080) at sig.c:489
37:MPT: #4  0x00002ad4deab034b in slave_sig_handler (signo=8, siginfo=<optimized out>, 
37:MPT:     extra=<optimized out>) at sig.c:564
37:MPT: #5  <signal handler called>
37:MPT: #6  0x0000000002d0e761 in fv_sat_adj_mp_fv_sat_adj_work_ ()
37:MPT:     at /glade/scratch/jedwards/SMS_D.C96.GFSv15p2.cheyenne_intel.snoop/bld/atm/obj/FV3/ccpp/physics/physics/gfdl_fv_sat_adj.F90:664
37:MPT: #7  0x0000000002d0b276 in fv_sat_adj_mp_fv_sat_adj_run_ ()
37:MPT:     at /glade/scratch/jedwards/SMS_D.C96.GFSv15p2.cheyenne_intel.snoop/bld/atm/obj/FV3/ccpp/physics/physics/gfdl_fv_sat_adj.F90:330
37:MPT: #8  0x0000000002be87db in ccpp_fv3_gfs_v15p2_fast_physics_cap_mp_fv3_gfs_v15p2_fast_physics_run_cap_ ()
37:MPT:     at /glade/scratch/jedwards/SMS_D.C96.GFSv15p2.cheyenne_intel.snoop/bld/atm/obj/FV3/ccpp/physics/ccpp_FV3_GFS_v15p2_fast_physics_cap.F90:106
37:MPT: #9  0x0000000002bdf0df in ccpp_static_api::ccpp_physics_run (cdata=..., 
37:MPT:     suite_name=..., group_name=..., ierr=0, .tmp.SUITE_NAME.len_V$97da=13, 
37:MPT:     .tmp.GROUP_NAME.len_V$97dd=12)
37:MPT:     at /glade/scratch/jedwards/SMS_D.C96.GFSv15p2.cheyenne_intel.snoop/bld/atm/obj/FV3/ccpp/physics/ccpp_static_api.F90:143
37:MPT: #10 0x0000000000d4340a in fv_mapz_mod::lagrangian_to_eulerian (

The text was updated successfully, but these errors were encountered:

pjpegion · 2020-01-13T22:32:24Z

@rsdunlapiv I believe that is just the place where an un-physical temperature triggers the array index to go out of bounds (it is associated with the calculation of the saturation specific humidity.)

climbfuji · 2020-01-13T23:16:49Z

Dusan merged an update to the ufs_public_release earlier today, part of the update was to address regression test failures in debug mode and to enable those tests for both 15p2 and 16beta as standard regression tests. These tests passed on Cheyenne with Intel and GNU and on Hera with Intel; they are based on the C96 configurations. See ufs-community/ufs-weather-model#25 and https://github.com/ufs-community/ufs-weather-model/blob/ufs_public_release/tests/rt.conf. Many questions: resolution, setup (namelist etc.), initial conditions? Can you point me to the run directory on Cheyenne, please?

jedwards4b · 2020-01-14T20:54:20Z

Some of these questions are answered in the cime test name: SMS_D.C96.GFSv15p2.cheyenne_intel
The resolution is C96
The CCPP is v15p2
Machine is cheyenne
Compiler is intel

Initial conditions are 2019-09-09 00
Case directory is /glade/scratch/jedwards/SMS_D.C96.GFSv15p2.cheyenne_intel.G.grp
Run directory is /glade/scratch/jedwards/SMS_D.C96.GFSv15p2.cheyenne_intel.G.grp/run

jedwards4b · 2020-01-14T20:56:48Z

I am using
commit bde62f9116cc9bdebaae0c6057090fe468eae917
Author: Dom Heinzeller climbfuji@ymail.com
Date: Mon Jan 13 06:46:37 2020 -0700

Which is the latest available.

jedwards4b · 2020-01-14T20:58:18Z

I get a similar error on stampede using v16beta at C96 resolution:
forrtl: error (65): floating invalid
Image PC Routine Line Source
ufs.exe 0000000004DC4E6F Unknown Unknown Unknown
libpthread-2.17.s 00002AD70FCA05D0 Unknown Unknown Unknown
ufs.exe 0000000002C9304B Unknown Unknown Unknown
ufs.exe 0000000002B48CA1 Unknown Unknown Unknown
ufs.exe 0000000002AA9A85 Unknown Unknown Unknown
ufs.exe 0000000002A9A8D3 ccpp_static_api_m 147 ccpp_static_api.F90
ufs.exe 0000000002AA02B9 ccpp_driver_mp_cc 234 CCPP_driver.F90
ufs.exe 000000000063684F atmos_model_mod_m 338 atmos_model.F90
ufs.exe 000000000062A0DE module_fcst_grid_ 707 module_fcst_grid_comp.F90

jedwards4b · 2020-01-14T21:00:34Z

On stampede using v15p2 at C96:
floating divide by zero
Image PC Routine Line Source
ufs.exe 0000000004DBBC8F Unknown Unknown Unknown
libpthread-2.17.s 00002AFB84DDD5D0 Unknown Unknown Unknown
ufs.exe 0000000002CCDE5E Unknown Unknown Unknown
ufs.exe 0000000002C22E39 Unknown Unknown Unknown
ufs.exe 0000000002B74C17 Unknown Unknown Unknown
ufs.exe 0000000002AA3D13 Unknown Unknown Unknown
ufs.exe 0000000002A9A72F ccpp_static_api_m 145 ccpp_static_api.F90
ufs.exe 0000000002A9F502 ccpp_driver_mp_cc 197 CCPP_driver.F90
ufs.exe 0000000000634AB0 atmos_model_mod_m 295 atmos_model.F90
ufs.exe 000000000062A0DE module_fcst_grid_ 707 module_fcst_grid_comp.F90

climbfuji · 2020-01-15T01:33:19Z

I am at AMS this week and don't have much time to look into this. The easiest way forward imo is to compare the run directory (input files, namelist, ...) of your CIME setup to the ufs-weather-model regression test setup (which uses rt.sh to run and which completes successfully) on Cheyenne using the Intel compiler. I can point you to a directory containing a successful run if that helps.

jedwards4b · 2020-01-15T01:44:19Z

Please point me to a successful run with debug flags enabled and I will compare.

climbfuji · 2020-01-15T03:10:44Z

Jim, see

/glade/work/heinzell/fv3/debug_tests_for_cime_20200114/fv3_ccpp_gfs_v15p2_debug_prod/
/glade/work/heinzell/fv3/debug_tests_for_cime_20200114/fv3_ccpp_gfs_v16beta_debug_prod/

These are C96 test cases as in your CIME setup, and both run to completion for a 6h forecast when the model is compiled with DEBUG=Y.

jedwards4b · 2020-01-15T12:32:39Z

@pjpegion I am ready to enlist your help. Instructions for running the tests on cheyenne are here:
https://docs.google.com/document/d/13nvpIS_q87ttjjHwB9f8OFXX7YI00DM5O9V-gk04yAY/edit?usp=sharing

ligiabernardet · 2020-01-15T12:43:29Z

@llpcarson @JulieSchramm Let's run this test on Cheyenne and use it to review (and update, if needed) the Weather Model User's Guide on the directory structure and lists of input/output files. Keep in mind that this run uses CIME, the WM UG should be relevant to those using CIME, as well as to those running the model in other ways.

climbfuji · 2020-01-15T13:31:57Z

Did the comparison with the run directories that I gave @jedwards4b lead to any insight? I am not sure it makes sense to have more people try to run and debug this unless we understand why the regression tests in the ufs-weather-model run to completion in DEBUG mode while the CIME runs don't.

jedwards4b · 2020-01-15T13:50:31Z

I found a couple of differences that I didn't understand and tried changing my values to yours - it didn't make any difference. It could just be due to the different initial conditions. Or it could be due to different build flags - but I didn't see any build output in the directory you pointed me to.

I think that it does make sense to have @pjpegion and @ligiabernardet and others become familiar with cime build and testing even if it doesn't lead to any insights regarding the test failure.

climbfuji · 2020-01-15T13:53:36Z

Ok, thanks or the info. I am happy to take a look as well.

…

On Jan 15, 2020, at 8:50 AM, jedwards4b ***@***.***> wrote: I found a couple of differences that I didn't understand and tried changing my values to yours - it didn't make any difference. It could just be due to the different initial conditions. Or it could be due to different build flags - but I didn't see any build output in the directory you pointed me to. I think that it does make sense to have @pjpegion <https://github.com/pjpegion> and @ligiabernardet <https://github.com/ligiabernardet> and others become familiar with cime build and testing even if it doesn't lead to any insights regarding the test failure. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#58?email_source=notifications&email_token=AB5C2RIGFWHEHE67UTNH4M3Q54ICRA5CNFSM4KGJYSI2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJALY5Q#issuecomment-574667894>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB5C2RKA3YAA7X4D6QEQBUDQ54ICRANCNFSM4KGJYSIQ>.

pjpegion · 2020-01-15T16:37:55Z

@jedwards4b I'm following your instructions and I ran into two problems so far.
1- had to add --project to ./create_test lime
2- Now I get an error
Case dir: /glade/scratch/pegion/SMS_D_Lh5.C96.GFSv15p2.cheyenne_intel.try
Errors were:
Building test for SMS in directory /glade/scratch/pegion/SMS_D_Lh5.C96.GFSv15p2.cheyenne_intel.try
ERROR: /glade/work/pegion/UFS/ufs-mrweather-app/src/model/FV3/cime/cime_config/buildnml /glade/scratch/pegion/SMS_D_Lh5.C96.GFSv15p2.cheyenne_intel.try FAILED, see above

and there is more info in /glade/scratch/pegion/SMS_D_Lh5.C96.GFSv15p2.cheyenne_intel.try/TestStatus.log

jedwards4b · 2020-01-15T16:41:24Z

@pjpegion This looks like a python version issue - what python are you using?

jedwards4b · 2020-01-15T16:44:24Z

I think that @uturuncoglu is using 2.7.13 and hasn't tested with python3 yet. I will fix, but if you could please try with default python on cheyenne.

pjpegion · 2020-01-15T16:45:26Z

I see my python is defaulting to /glade/u/home/pegion/miniconda3/bin/python
I will change that and try again.
Thanks.

climbfuji · 2020-01-15T16:50:30Z

This release is only compatible with Python 2.7.x (also because CCPP works only with those versions).

arunchawla-NOAA · 2020-01-15T16:51:59Z

Python 2.7 is getting deprecated this year. Does it make sense to limit to an unsupported version of Python? We are moving to Python 3 everywhere. --------------------------------------------------------------- Arun Chawla Chief Engineering & Implementation Branch Room 2083 National Center for Weather & Climate Prediction 5830 University Research Court College Park, MD 20740 Phone : 301-683-3740 Mobile : 240-564-5675 Fax : 301-683-3703 ------------------------------------------------------------

…

On Wed, Jan 15, 2020 at 11:50 AM Dom Heinzeller ***@***.***> wrote: This release is only compatible with Python 2.7.x (also because CCPP works only with those versions). — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#58?email_source=notifications&email_token=AL5NYI2G7Q4P2FW3C3U7EZDQ545FPA5CNFSM4KGJYSI2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJA76VI#issuecomment-574750549>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AL5NYIZGVJ7URKY3IOTT7SLQ545FPANCNFSM4KGJYSIQ> .

climbfuji · 2020-01-15T16:54:25Z

For the next anticipated release of the ufs (with SAR etc) later this year we will hopefully be able to support Python 3. I don't see a chance to rewrite the code to work with Python 3, and what is more it will probably take years for Python 2.7 to completely disappear from HPCs and standard OS installations.

jedwards4b · 2020-01-15T16:54:50Z

CIME is fully compatible with and tested with python 3.6 as well as python 2.7. The fv3_interface issue should be easy to fix.

pjpegion · 2020-01-15T17:27:34Z

@jedwards4b The model nows builds and the run starts. But the model crashes in initialization, log file is
/glade/scratch/pegion/SMS_D_Lh5.C96.GFSv15p2.cheyenne_intel.try/run/ufs.log.238590.chadmin1.ib0.cheyenne.ucar.edu.200115-095701

climbfuji · 2020-01-15T17:36:51Z

Can someone point me to the job submission script for this job, please? Thanks ...

pjpegion · 2020-01-15T17:44:31Z

@climbfuji I am running out of /glade/work/pegion/UFS/ufs-mrweather-app/cime/scripts
command is ./create_test SMS_D_Lh5.C96.GFSv15p2 --workflow ufs-mrweather_wo_post --test-id try --project P93300042

climbfuji · 2020-01-15T17:49:09Z

Thanks, but I don't know how to find the actual job submission script (the file that contains the #PBS configuration entries and the mpiexec_mpt calls) from there. Maybe the CIME folks can help? We should always write/copy this job submission script into the run directory using a filename like job_card, because many developers who are used to rerun some of the stuff manually will want this. And it is also good for documentation purposes in my opinion.

jedwards4b · 2020-01-15T17:53:19Z

@pjpegion Now we are in the same place - I am trying to understand and fix this failure.

86:MPT: #6  0x0000000002d0e761 in fv_sat_adj_mp_fv_sat_adj_work_ ()                                                                         
86:MPT:     at /glade/scratch/pegion/SMS_D_Lh5.C96.GFSv15p2.cheyenne_intel.try/bld/atm/obj/FV3/ccpp/physics/physics/gfdl_fv_sat_adj.F90:664 
86:MPT: #7  0x0000000002d0b276 in fv_sat_adj_mp_fv_sat_adj_run_ ()                                                                          
86:MPT:     at /glade/scratch/pegion/SMS_D_Lh5.C96.GFSv15p2.cheyenne_intel.try/bld/atm/obj/FV3/ccpp/physics/physics/gfdl_fv_sat_adj.F90:330 
86:MPT: #8  0x0000000002be87db in ccpp_fv3_gfs_v15p2_fast_physics_cap_mp_fv3_gfs_v15p2_fast_physics_run_cap_ ()                             
86:MPT:     at /glade/scratch/pegion/SMS_D_Lh5.C96.GFSv15p2.cheyenne_intel.try/bld/atm/obj/FV3/ccpp/physics/ccpp_FV3_GFS_v15p2_fast_physics\
_cap.F90:106                                                                                                                                
86:MPT: #9  0x0000000002bdf0df in ccpp_static_api::ccpp_physics_run (cdata=...,                                                             
86:MPT:     suite_name=..., group_name=..., ierr=0, .tmp.SUITE_NAME.len_V$97da=13,                                                          
86:MPT:     .tmp.GROUP_NAME.len_V$97dd=12)                                                                                                  
86:MPT:     at /glade/scratch/pegion/SMS_D_Lh5.C96.GFSv15p2.cheyenne_intel.try/bld/atm/obj/FV3/ccpp/physics/ccpp_static_api.F90:143         
86:MPT: #10 0x0000000000d4340a in fv_mapz_mod::lagrangian_to_eulerian (

climbfuji · 2020-01-15T17:56:31Z

I I had to guess I would say initial conditions. It's the first time it is calling the saturation adjustment as part of the dynamics before doing any physics, i.e. right after reading the initial conditions. I am downloading the run dirs for my rt.sh run and your cime run to my laptop to take a closer look at the diffs.

jedwards4b · 2020-01-15T17:57:21Z

@climbfuji The job submission script is in the case directory:
./case.submit

If you want to see what the script will submit you would run
./preview_run

By default we will submit the chgres and then the model - if you only want to submit the model use
./case.submit --job case.test

jedwards4b · 2020-01-15T17:58:20Z

@climbfuji please point me to your build log - I want to confirm that we are using the same flags to build ccpp.

climbfuji · 2020-01-15T18:00:26Z

/glade/work/heinzell/fv3/ufs-weather-model/ufs-weather-model-public-release-20200114/tests/log_cheyenne.intel/compile_2.log

is the log for the debug tests (compile step)

jedwards4b · 2020-01-16T00:00:30Z

I did find a problem with the build and am working on it, but I don't think that it is related to this run failure and agree that there seems to be a problem with initial conditions.

jedwards4b · 2020-01-16T01:28:47Z

It turns out that correcting the issue with build flags changed the error - it's making it past initialization now and crashing a little further into the run. The error is now in file moninedmf.f where the value of stress is < 0 in a couple of places:
TASKID FILE LINE VALUE INDEX
89: moninedmf.f 412 -2.213609288845146E+021 2
90: moninedmf.f 412 -4.427218577690292E+021 6

jedwards4b · 2020-01-16T15:52:43Z

I was able to run to completion by using the initial conditions in
/glade/work/heinzell/fv3/debug_tests_for_cime_20200114/fv3_ccpp_gfs_v15p2_debug_prod/INPUT

This points to a problem in chgres or in the initial condition files themselves. I'm not sure where to go from here. @uturuncoglu @climbfuji

GeorgeGayno-NOAA · 2020-01-16T17:30:02Z

It turns out that correcting the issue with build flags changed the error - it's making it past initialization now and crashing a little further into the run. The error is now in file moninedmf.f where the value of stress is < 0 in a couple of places:
TASKID FILE LINE VALUE INDEX
89: moninedmf.f 412 -2.213609288845146E+021 2
90: moninedmf.f 412 -4.427218577690292E+021 6

Dusan had the same error a couple months ago. It was traced to ice concentrations greater than 1.0 (such as 1.0000000000004) in the initial surface file from chgres. A fix was added. Can you merge the latest chgres from 'develop' to your branch?

jedwards4b · 2020-01-16T17:46:58Z

@arunchawla-NOAA I have opened issue NOAA-EMC/NCEPLIBS#21 but I am not sure who to assign.

rsdunlapiv assigned arunchawla-NOAA Jan 13, 2020

jedwards4b mentioned this issue Jan 16, 2020

Problem with chgres_cube #61

Closed

jedwards4b closed this as completed Jan 16, 2020

jedwards4b mentioned this issue Jan 16, 2020

Update chgres to latest develop NOAA-EMC/NCEPLIBS#21

Closed

jedwards4b mentioned this issue Jan 23, 2020

Test SMS_Lh3_D.C96.GFSv15p2.cheyenne_intel failing #69

Closed

UFSATM failure with Intel with debug flags #58

UFSATM failure with Intel with debug flags #58

Comments

rsdunlapiv commented Jan 13, 2020 • edited Loading

pjpegion commented Jan 13, 2020

climbfuji commented Jan 13, 2020

jedwards4b commented Jan 14, 2020

jedwards4b commented Jan 14, 2020

jedwards4b commented Jan 14, 2020

jedwards4b commented Jan 14, 2020

climbfuji commented Jan 15, 2020

jedwards4b commented Jan 15, 2020

climbfuji commented Jan 15, 2020

jedwards4b commented Jan 15, 2020

ligiabernardet commented Jan 15, 2020

climbfuji commented Jan 15, 2020

jedwards4b commented Jan 15, 2020

climbfuji commented Jan 15, 2020 via email

pjpegion commented Jan 15, 2020

jedwards4b commented Jan 15, 2020

jedwards4b commented Jan 15, 2020

pjpegion commented Jan 15, 2020

climbfuji commented Jan 15, 2020

arunchawla-NOAA commented Jan 15, 2020 via email

climbfuji commented Jan 15, 2020

jedwards4b commented Jan 15, 2020

pjpegion commented Jan 15, 2020

climbfuji commented Jan 15, 2020

pjpegion commented Jan 15, 2020

climbfuji commented Jan 15, 2020

jedwards4b commented Jan 15, 2020

climbfuji commented Jan 15, 2020

jedwards4b commented Jan 15, 2020

jedwards4b commented Jan 15, 2020

climbfuji commented Jan 15, 2020

jedwards4b commented Jan 16, 2020

jedwards4b commented Jan 16, 2020

jedwards4b commented Jan 16, 2020

GeorgeGayno-NOAA commented Jan 16, 2020

jedwards4b commented Jan 16, 2020

rsdunlapiv commented Jan 13, 2020 •

edited

Loading