Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UFSATM failure with Intel with debug flags #58

Closed
rsdunlapiv opened this issue Jan 13, 2020 · 36 comments
Closed

UFSATM failure with Intel with debug flags #58

rsdunlapiv opened this issue Jan 13, 2020 · 36 comments
Assignees

Comments

@rsdunlapiv
Copy link
Collaborator

rsdunlapiv commented Jan 13, 2020

We received the following error running UFSATM with Intel 19/MPT on Cheyenne with physics GFSv15p2. We are running in debug mode and a SIGFPE was caught inside the physics (see stack trace below).

CIME test:
SMS_D.C96.GFSv15p2.cheyenne_intel

Modules:

module load ncarenv/1.2 intel/19.0.2 esmf_libs mkl
module use /glade/work/turuncu/PROGS/modulefiles/esmfpkgs/intel/19.0.2
module load esmf-8.0.0-ncdfio-mpt-g mpt/2.19 netcdf/4.7.1 pnetcdf/1.11.1 ncarcompilers/0.5.0

Hash of UFS weather model:
ufs-community/ufs-weather-model@bde62f9

We can provide more information, as needed, on the initial conditions.

Stack trace:

37:MPT: #1  0x00002ad4deaafdb6 in mpi_sgi_system (
37:MPT: #2  MPI_SGI_stacktraceback (
37:MPT:     header=header@entry=0x7ffcd7c52b00 "MPT ERROR: Rank 37(g:37) received signal SIGFPE(8).\n\tProcess ID: 40083, Host: r2i3n29, Program: /glade/scratch/jedwards/SMS_D.C96.GFSv15p2.cheyenne_intel.snoop/bld/ufs.exe\n\tMPT Version: HPE MPT 2.19  "...) at sig.c:340
37:MPT: #3  0x00002ad4deaaffb2 in first_arriver_handler (signo=signo@entry=8, 
37:MPT:     stack_trace_sem=stack_trace_sem@entry=0x2ad4eb6a0080) at sig.c:489
37:MPT: #4  0x00002ad4deab034b in slave_sig_handler (signo=8, siginfo=<optimized out>, 
37:MPT:     extra=<optimized out>) at sig.c:564
37:MPT: #5  <signal handler called>
37:MPT: #6  0x0000000002d0e761 in fv_sat_adj_mp_fv_sat_adj_work_ ()
37:MPT:     at /glade/scratch/jedwards/SMS_D.C96.GFSv15p2.cheyenne_intel.snoop/bld/atm/obj/FV3/ccpp/physics/physics/gfdl_fv_sat_adj.F90:664
37:MPT: #7  0x0000000002d0b276 in fv_sat_adj_mp_fv_sat_adj_run_ ()
37:MPT:     at /glade/scratch/jedwards/SMS_D.C96.GFSv15p2.cheyenne_intel.snoop/bld/atm/obj/FV3/ccpp/physics/physics/gfdl_fv_sat_adj.F90:330
37:MPT: #8  0x0000000002be87db in ccpp_fv3_gfs_v15p2_fast_physics_cap_mp_fv3_gfs_v15p2_fast_physics_run_cap_ ()
37:MPT:     at /glade/scratch/jedwards/SMS_D.C96.GFSv15p2.cheyenne_intel.snoop/bld/atm/obj/FV3/ccpp/physics/ccpp_FV3_GFS_v15p2_fast_physics_cap.F90:106
37:MPT: #9  0x0000000002bdf0df in ccpp_static_api::ccpp_physics_run (cdata=..., 
37:MPT:     suite_name=..., group_name=..., ierr=0, .tmp.SUITE_NAME.len_V$97da=13, 
37:MPT:     .tmp.GROUP_NAME.len_V$97dd=12)
37:MPT:     at /glade/scratch/jedwards/SMS_D.C96.GFSv15p2.cheyenne_intel.snoop/bld/atm/obj/FV3/ccpp/physics/ccpp_static_api.F90:143
37:MPT: #10 0x0000000000d4340a in fv_mapz_mod::lagrangian_to_eulerian (
@pjpegion
Copy link
Collaborator

@rsdunlapiv I believe that is just the place where an un-physical temperature triggers the array index to go out of bounds (it is associated with the calculation of the saturation specific humidity.)

@climbfuji
Copy link
Collaborator

Dusan merged an update to the ufs_public_release earlier today, part of the update was to address regression test failures in debug mode and to enable those tests for both 15p2 and 16beta as standard regression tests. These tests passed on Cheyenne with Intel and GNU and on Hera with Intel; they are based on the C96 configurations. See ufs-community/ufs-weather-model#25 and https://github.com/ufs-community/ufs-weather-model/blob/ufs_public_release/tests/rt.conf. Many questions: resolution, setup (namelist etc.), initial conditions? Can you point me to the run directory on Cheyenne, please?

@jedwards4b
Copy link
Collaborator

Some of these questions are answered in the cime test name: SMS_D.C96.GFSv15p2.cheyenne_intel
The resolution is C96
The CCPP is v15p2
Machine is cheyenne
Compiler is intel

Initial conditions are 2019-09-09 00
Case directory is /glade/scratch/jedwards/SMS_D.C96.GFSv15p2.cheyenne_intel.G.grp
Run directory is /glade/scratch/jedwards/SMS_D.C96.GFSv15p2.cheyenne_intel.G.grp/run

@jedwards4b
Copy link
Collaborator

I am using
commit bde62f9116cc9bdebaae0c6057090fe468eae917
Author: Dom Heinzeller climbfuji@ymail.com
Date: Mon Jan 13 06:46:37 2020 -0700

Which is the latest available.

@jedwards4b
Copy link
Collaborator

I get a similar error on stampede using v16beta at C96 resolution:
forrtl: error (65): floating invalid
Image PC Routine Line Source
ufs.exe 0000000004DC4E6F Unknown Unknown Unknown
libpthread-2.17.s 00002AD70FCA05D0 Unknown Unknown Unknown
ufs.exe 0000000002C9304B Unknown Unknown Unknown
ufs.exe 0000000002B48CA1 Unknown Unknown Unknown
ufs.exe 0000000002AA9A85 Unknown Unknown Unknown
ufs.exe 0000000002A9A8D3 ccpp_static_api_m 147 ccpp_static_api.F90
ufs.exe 0000000002AA02B9 ccpp_driver_mp_cc 234 CCPP_driver.F90
ufs.exe 000000000063684F atmos_model_mod_m 338 atmos_model.F90
ufs.exe 000000000062A0DE module_fcst_grid_ 707 module_fcst_grid_comp.F90

@jedwards4b
Copy link
Collaborator

On stampede using v15p2 at C96:
floating divide by zero
Image PC Routine Line Source
ufs.exe 0000000004DBBC8F Unknown Unknown Unknown
libpthread-2.17.s 00002AFB84DDD5D0 Unknown Unknown Unknown
ufs.exe 0000000002CCDE5E Unknown Unknown Unknown
ufs.exe 0000000002C22E39 Unknown Unknown Unknown
ufs.exe 0000000002B74C17 Unknown Unknown Unknown
ufs.exe 0000000002AA3D13 Unknown Unknown Unknown
ufs.exe 0000000002A9A72F ccpp_static_api_m 145 ccpp_static_api.F90
ufs.exe 0000000002A9F502 ccpp_driver_mp_cc 197 CCPP_driver.F90
ufs.exe 0000000000634AB0 atmos_model_mod_m 295 atmos_model.F90
ufs.exe 000000000062A0DE module_fcst_grid_ 707 module_fcst_grid_comp.F90

@climbfuji
Copy link
Collaborator

I am at AMS this week and don't have much time to look into this. The easiest way forward imo is to compare the run directory (input files, namelist, ...) of your CIME setup to the ufs-weather-model regression test setup (which uses rt.sh to run and which completes successfully) on Cheyenne using the Intel compiler. I can point you to a directory containing a successful run if that helps.

@jedwards4b
Copy link
Collaborator

Please point me to a successful run with debug flags enabled and I will compare.

@climbfuji
Copy link
Collaborator

Jim, see

/glade/work/heinzell/fv3/debug_tests_for_cime_20200114/fv3_ccpp_gfs_v15p2_debug_prod/
/glade/work/heinzell/fv3/debug_tests_for_cime_20200114/fv3_ccpp_gfs_v16beta_debug_prod/

These are C96 test cases as in your CIME setup, and both run to completion for a 6h forecast when the model is compiled with DEBUG=Y.

@jedwards4b
Copy link
Collaborator

@pjpegion I am ready to enlist your help. Instructions for running the tests on cheyenne are here:
https://docs.google.com/document/d/13nvpIS_q87ttjjHwB9f8OFXX7YI00DM5O9V-gk04yAY/edit?usp=sharing

@ligiabernardet
Copy link
Collaborator

@llpcarson @JulieSchramm Let's run this test on Cheyenne and use it to review (and update, if needed) the Weather Model User's Guide on the directory structure and lists of input/output files. Keep in mind that this run uses CIME, the WM UG should be relevant to those using CIME, as well as to those running the model in other ways.

@climbfuji
Copy link
Collaborator

Did the comparison with the run directories that I gave @jedwards4b lead to any insight? I am not sure it makes sense to have more people try to run and debug this unless we understand why the regression tests in the ufs-weather-model run to completion in DEBUG mode while the CIME runs don't.

@jedwards4b
Copy link
Collaborator

I found a couple of differences that I didn't understand and tried changing my values to yours - it didn't make any difference. It could just be due to the different initial conditions. Or it could be due to different build flags - but I didn't see any build output in the directory you pointed me to.

I think that it does make sense to have @pjpegion and @ligiabernardet and others become familiar with cime build and testing even if it doesn't lead to any insights regarding the test failure.

@climbfuji
Copy link
Collaborator

climbfuji commented Jan 15, 2020 via email

@pjpegion
Copy link
Collaborator

@jedwards4b I'm following your instructions and I ran into two problems so far.
1- had to add --project to ./create_test lime
2- Now I get an error
Case dir: /glade/scratch/pegion/SMS_D_Lh5.C96.GFSv15p2.cheyenne_intel.try
Errors were:
Building test for SMS in directory /glade/scratch/pegion/SMS_D_Lh5.C96.GFSv15p2.cheyenne_intel.try
ERROR: /glade/work/pegion/UFS/ufs-mrweather-app/src/model/FV3/cime/cime_config/buildnml /glade/scratch/pegion/SMS_D_Lh5.C96.GFSv15p2.cheyenne_intel.try FAILED, see above

and there is more info in /glade/scratch/pegion/SMS_D_Lh5.C96.GFSv15p2.cheyenne_intel.try/TestStatus.log

@jedwards4b
Copy link
Collaborator

@pjpegion This looks like a python version issue - what python are you using?

@jedwards4b
Copy link
Collaborator

I think that @uturuncoglu is using 2.7.13 and hasn't tested with python3 yet. I will fix, but if you could please try with default python on cheyenne.

@pjpegion
Copy link
Collaborator

I see my python is defaulting to /glade/u/home/pegion/miniconda3/bin/python
I will change that and try again.
Thanks.

@climbfuji
Copy link
Collaborator

This release is only compatible with Python 2.7.x (also because CCPP works only with those versions).

@arunchawla-NOAA
Copy link
Collaborator

arunchawla-NOAA commented Jan 15, 2020 via email

@climbfuji
Copy link
Collaborator

For the next anticipated release of the ufs (with SAR etc) later this year we will hopefully be able to support Python 3. I don't see a chance to rewrite the code to work with Python 3, and what is more it will probably take years for Python 2.7 to completely disappear from HPCs and standard OS installations.

@jedwards4b
Copy link
Collaborator

CIME is fully compatible with and tested with python 3.6 as well as python 2.7. The fv3_interface issue should be easy to fix.

@pjpegion
Copy link
Collaborator

@jedwards4b The model nows builds and the run starts. But the model crashes in initialization, log file is
/glade/scratch/pegion/SMS_D_Lh5.C96.GFSv15p2.cheyenne_intel.try/run/ufs.log.238590.chadmin1.ib0.cheyenne.ucar.edu.200115-095701

@climbfuji
Copy link
Collaborator

Can someone point me to the job submission script for this job, please? Thanks ...

@pjpegion
Copy link
Collaborator

@climbfuji I am running out of /glade/work/pegion/UFS/ufs-mrweather-app/cime/scripts
command is ./create_test SMS_D_Lh5.C96.GFSv15p2 --workflow ufs-mrweather_wo_post --test-id try --project P93300042

@climbfuji
Copy link
Collaborator

Thanks, but I don't know how to find the actual job submission script (the file that contains the #PBS configuration entries and the mpiexec_mpt calls) from there. Maybe the CIME folks can help? We should always write/copy this job submission script into the run directory using a filename like job_card, because many developers who are used to rerun some of the stuff manually will want this. And it is also good for documentation purposes in my opinion.

@jedwards4b
Copy link
Collaborator

@pjpegion Now we are in the same place - I am trying to understand and fix this failure.

86:MPT: #6  0x0000000002d0e761 in fv_sat_adj_mp_fv_sat_adj_work_ ()                                                                         
86:MPT:     at /glade/scratch/pegion/SMS_D_Lh5.C96.GFSv15p2.cheyenne_intel.try/bld/atm/obj/FV3/ccpp/physics/physics/gfdl_fv_sat_adj.F90:664 
86:MPT: #7  0x0000000002d0b276 in fv_sat_adj_mp_fv_sat_adj_run_ ()                                                                          
86:MPT:     at /glade/scratch/pegion/SMS_D_Lh5.C96.GFSv15p2.cheyenne_intel.try/bld/atm/obj/FV3/ccpp/physics/physics/gfdl_fv_sat_adj.F90:330 
86:MPT: #8  0x0000000002be87db in ccpp_fv3_gfs_v15p2_fast_physics_cap_mp_fv3_gfs_v15p2_fast_physics_run_cap_ ()                             
86:MPT:     at /glade/scratch/pegion/SMS_D_Lh5.C96.GFSv15p2.cheyenne_intel.try/bld/atm/obj/FV3/ccpp/physics/ccpp_FV3_GFS_v15p2_fast_physics\
_cap.F90:106                                                                                                                                
86:MPT: #9  0x0000000002bdf0df in ccpp_static_api::ccpp_physics_run (cdata=...,                                                             
86:MPT:     suite_name=..., group_name=..., ierr=0, .tmp.SUITE_NAME.len_V$97da=13,                                                          
86:MPT:     .tmp.GROUP_NAME.len_V$97dd=12)                                                                                                  
86:MPT:     at /glade/scratch/pegion/SMS_D_Lh5.C96.GFSv15p2.cheyenne_intel.try/bld/atm/obj/FV3/ccpp/physics/ccpp_static_api.F90:143         
86:MPT: #10 0x0000000000d4340a in fv_mapz_mod::lagrangian_to_eulerian (                                                                  

@climbfuji
Copy link
Collaborator

I I had to guess I would say initial conditions. It's the first time it is calling the saturation adjustment as part of the dynamics before doing any physics, i.e. right after reading the initial conditions. I am downloading the run dirs for my rt.sh run and your cime run to my laptop to take a closer look at the diffs.

@jedwards4b
Copy link
Collaborator

@climbfuji The job submission script is in the case directory:
./case.submit

If you want to see what the script will submit you would run
./preview_run

By default we will submit the chgres and then the model - if you only want to submit the model use
./case.submit --job case.test

@jedwards4b
Copy link
Collaborator

@climbfuji please point me to your build log - I want to confirm that we are using the same flags to build ccpp.

@climbfuji
Copy link
Collaborator

/glade/work/heinzell/fv3/ufs-weather-model/ufs-weather-model-public-release-20200114/tests/log_cheyenne.intel/compile_2.log

is the log for the debug tests (compile step)

@jedwards4b
Copy link
Collaborator

I did find a problem with the build and am working on it, but I don't think that it is related to this run failure and agree that there seems to be a problem with initial conditions.

@jedwards4b
Copy link
Collaborator

It turns out that correcting the issue with build flags changed the error - it's making it past initialization now and crashing a little further into the run. The error is now in file moninedmf.f where the value of stress is < 0 in a couple of places:
TASKID FILE LINE VALUE INDEX
89: moninedmf.f 412 -2.213609288845146E+021 2
90: moninedmf.f 412 -4.427218577690292E+021 6

@jedwards4b
Copy link
Collaborator

I was able to run to completion by using the initial conditions in
/glade/work/heinzell/fv3/debug_tests_for_cime_20200114/fv3_ccpp_gfs_v15p2_debug_prod/INPUT

This points to a problem in chgres or in the initial condition files themselves. I'm not sure where to go from here. @uturuncoglu @climbfuji

@GeorgeGayno-NOAA
Copy link
Collaborator

It turns out that correcting the issue with build flags changed the error - it's making it past initialization now and crashing a little further into the run. The error is now in file moninedmf.f where the value of stress is < 0 in a couple of places:
TASKID FILE LINE VALUE INDEX
89: moninedmf.f 412 -2.213609288845146E+021 2
90: moninedmf.f 412 -4.427218577690292E+021 6

Dusan had the same error a couple months ago. It was traced to ice concentrations greater than 1.0 (such as 1.0000000000004) in the initial surface file from chgres. A fix was added. Can you merge the latest chgres from 'develop' to your branch?

@jedwards4b
Copy link
Collaborator

@arunchawla-NOAA I have opened issue NOAA-EMC/NCEPLIBS#21 but I am not sure who to assign.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants