Should we add floating-point traps to our DEBUG GNU builds? `-ffpe-trap=invalid,zero,overflow` #1231

ndkeen · 2017-01-20T18:29:54Z

For Debug Intel builds, we use the -fpe0 flag which will stop the code on invalid, divide-by-zero, and overflow. However, I don't see these traps enabled for GNU DEBUG builds.
Should we add: -ffpe-trap=invalid,zero,overflow ?

This is from the man page for GNU fortran:

-ffpe-trap=list

Specify a list of floating point exception traps to enable. On most
systems, if a floating point exception occurs and the trap for that
exception is enabled, a SIGFPE signal will be sent and the program
being aborted, producing a core file useful for debugging. list is a
(possibly empty) comma-separated list of the following exceptions:
`invalid' (invalid floating point operation, such as SQRT(-1.0)),
`zero' (division by zero), `overflow' (overflow in a floating point
operation), `underflow' (underflow in a floating point operation),
`inexact' (loss of precision during operation), and `denormal'
(operation performed on a denormal value). The first five exceptions
correspond to the five IEEE 754 exceptions, whereas the last one
(`denormal') is not part of the IEEE 754 standard but is available on
some common architectures such as x86.

The first three exceptions (`invalid', `zero', and `overflow') often
indicate serious errors, and unless the program has provisions for
dealing with these exceptions, enabling traps for these three
exceptions is probably a good idea.

Many, if not most, floating point operations incur loss of precision
due to rounding, and hence the ffpe-trap=inexact is likely to be
uninteresting in practice.

By default no exception traps are enabled.

This is more man page for Intel Fortran:

-fpe0:

Floating-point invalid, divide-by-zero, and overflow exceptions are
enabled throughout the application when the main program s compiled
with this value. If any such exceptions ccur, execution is
aborted. This option causes enormalized floating-point results to be
set to ero. Underflow results will also be set to zero, unless ou
override this by explicitly specifying option no-ftz or -fp-model
precise (Linux* and OS X*) or ption /Qftz- or /fp:precise (Windows*).

Underflow results from SSE instructions, as well as x87 instructions,
will be set to zero. By contrast, option [Q]ftz only sets SSE
underflow results to zero.

Sets option -fp-speculation=strict (Linux* and OS X*) or
/Qfp-speculation:strict (Windows*) for any program unit compiled with
-fpe0 (Linux* and OS X*) or /fpe:0 (Windows*). This disables certain
optimizations in cases where speculative execution of floating-point
operations could lead to floating-point exceptions that would not
occur in the absence of speculation.  For example, this may prevent
the vectorization of some loops containing conditionals.

To get more detailed location information about where the error
occurred, use option traceback.

The default is -fpe3:
All floating-point exceptions are disabled.  Floating-point
underflow is gradual, unless you explicitly specify a compiler option
that enables flush-to-zero, such as [Q]ftz, O3, or O2. This setting
provides full IEEE support.

The text was updated successfully, but these errors were encountered:

mt5555 · 2017-01-21T19:49:11Z

i vote yes, we should enable this for GNU.

golaz · 2017-01-23T18:01:09Z

I'm in favor as well.

rljacob · 2017-01-23T18:12:01Z

Yes. I don't see a problem with making this the default in the gnu entry for config_compilers.xml.
I think we just need to change this line:
<ADD_FFLAGS DEBUG="TRUE"> -g -Wall </ADD_FFLAGS>

ndkeen · 2017-01-26T19:11:16Z

I noted that we were light on debug flags for GNU. So I added the ones mentioned above as well -Og -fbacktrace -fcheck=bounds, such that a debug build for GNU would look like:

ifeq ($(DEBUG), TRUE)
   FFLAGS +=  -g -Wall -Og -fbacktrace -fcheck=bounds -ffpe-trap=invalid,zero,overflow
   CFLAGS +=  -g -Wall -Og -fbacktrace -fcheck=bounds -ffpe-trap=invalid,zero,overflow
endif

I read that -Og differs from -O0 in that -Og will try optimizations that should not affect debugging. Re-ran acme_dev. Obviously, only the DEBUG=TRUE tests of acme-dev would be affected.
One test ran out of time (submitted with more time). One test stopped for no reason -- submitted again. However, one test did fail with FP exception:

47:  rtm decomp info proc =           47  begr =       128482  endr =       131186  numr =         2705
47:                  proc =           47  begrl=        64763  endrl=        65406  numrl=          644
47:                  proc =           47  begro=        63720  endro=        65780  numro=         2061
47: 
47: Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
47: 
47: Backtrace for this error:
47: #0  0x1d3a4cf in ???
47:     at /home/abuild/rpmbuild/BUILD/glibc-2.19/nptl/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0
47: #1  0xe2c24c in __hetfrz_classnuc_cam_MOD_hetfrz_classnuc_cam_calc
47:     at /global/cscratch1/sd/ndk/wacmy/ndk_machinefiles_gnu-fpe-trap/components/cam/src/physics/cam/hetfrz_classnuc_cam.F90:967
47: #2  0xb68ef3 in __microp_aero_MOD_microp_aero_run
47:     at /global/cscratch1/sd/ndk/wacmy/ndk_machinefiles_gnu-fpe-trap/components/cam/src/physics/cam/microp_aero.F90:852
47: #3  0x5d6459 in tphysbc
47:     at /global/cscratch1/sd/ndk/wacmy/ndk_machinefiles_gnu-fpe-trap/components/cam/src/physics/cam/physpkg.F90:2407
47: #4  0x5de82a in __physpkg_MOD_phys_run1
47:     at /global/cscratch1/sd/ndk/wacmy/ndk_machinefiles_gnu-fpe-trap/components/cam/src/physics/cam/physpkg.F90:1010
47: #5  0x4db075 in __cam_comp_MOD_cam_run1
47:     at /global/cscratch1/sd/ndk/wacmy/ndk_machinefiles_gnu-fpe-trap/components/cam/src/control/cam_comp.F90:250
47: #6  0x4d4123 in __atm_comp_mct_MOD_atm_run_mct
47:     at /global/cscratch1/sd/ndk/wacmy/ndk_machinefiles_gnu-fpe-trap/components/cam/src/cpl/atm_comp_mct.F90:522
47: #7  0x42ac24 in __component_mod_MOD_component_run
47:     at /global/cscratch1/sd/ndk/wacmy/ndk_machinefiles_gnu-fpe-trap/cime/driver_cpl/driver/component_mod.F90:653
47: #8  0x4192b7 in __cesm_comp_mod_MOD_cesm_run
47:     at /global/cscratch1/sd/ndk/wacmy/ndk_machinefiles_gnu-fpe-trap/cime/driver_cpl/driver/cesm_comp_mod.F90:3251
47: #9  0x429c2c in cesm_driver
47:     at /global/cscratch1/sd/ndk/wacmy/ndk_machinefiles_gnu-fpe-trap/cime/driver_cpl/driver/cesm_driver.F90:67

Here is the code:

   call outfld('BCFREZIMM', nnuccc_bc, pcols, lchnk)
   call outfld('BCFREZCNT', nnucct_bc, pcols, lchnk)
   call outfld('BCFREZDEP', nnudep_bc, pcols, lchnk)

   call outfld('NIMIX_IMM', niimm_bc+niimm_dst, pcols, lchnk)
   call outfld('NIMIX_CNT', nicnt_bc+nicnt_dst, pcols, lchnk)   
   call outfld('NIMIX_DEP', nidep_bc+nidep_dst, pcols, lchnk) !ndk SIGFPE: Floating-point exception - erroneous arithmetic operation. GNU/KNL

   call outfld('DSTNICNT', nicnt_dst, pcols, lchnk)
   call outfld('DSTNIDEP', nidep_dst, pcols, lchnk)
   call outfld('DSTNIIMM', niimm_dst, pcols, lchnk)

   call outfld('BCNICNT', nicnt_bc, pcols, lchnk)
   call outfld('BCNIDEP', nidep_bc, pcols, lchnk)
   call outfld('BCNIIMM', niimm_bc, pcols, lchnk)

singhbalwinder · 2017-01-26T19:37:17Z

Hi @ndkeen ,

Would you please try to submit the run again by commenting out the offending line to see if the model runs fine otherwise? The same quantities, nidep_bc and nidep_dst, are sent to outfld after this call, we might find which one of these is corrupt.

ndkeen · 2017-01-26T22:26:03Z

OK, I had sent a quick message to NERSC consultants regarding the other job that failed without info and they said that it looked like the job ran out of memory. Of course, I asked how they are getting this valuable information and why can't we have it... But sure, enough, that job as well as this one passed when I doubled the nodes I asked for (and slurm will evenly split up the MPI's, so I effectively get double the memory). I guess it makes sense that adding more debugging flags coudl use more memory, but good to know how close to the edge we are. I will see if it's easy to request more nodes for these problems in default PE layouts for cori-knl.

So false alarm.

singhbalwinder · 2017-01-27T18:09:13Z

That is good to know. Thanks Noel!

ndkeen · 2017-02-03T01:34:27Z

Still trying to find a PE layout that makes all tests happy for cori-knl. Running acme_developer on edison now.

All of the tests passed on edison (except HOMME is now failing to link, but unrelated to this change).

I can go ahead and merge this to let it be tested on other machines. If it catches something it did not before, that's good, but it could require a little more memory and might cause a fail.

Mvertens/drv flds in added mct/cime_config/namelist_definition_drv_flds.xml with updated schema removed bld directory updated schema for namelist_definition_drv_flds.xml put in error check that there are no duplicate entries in the drv drv_flds_in that have different values - verified that this works by having CLM change the same namelists that are set by CAM in drv_flds_in In addition to the scripts regression test - verified that the following tests were bfb with cesm2_0_alpha06f SMS_D_Ln9.f09_f09.FWAMIP.yellowstone_intel.cam-reduced_hist3s ERS_Ld7.f19_g16.B1850.yellowstone_intel.allactive-defaultio ERP_Ln9.f09_f09.FC55CLUBB.yellowstone_intel.cam-outfrq9s Test suite: scripts_regression_test Test baseline: Test namelist changes: Test status: bit for bit Fixes #1217 User interface changes?: None Code review: jedwards

ndkeen · 2017-05-05T16:02:35Z

Again, forgot to reference this issue with my PR.
#1256

rljacob · 2017-05-05T16:23:22Z

Go back to PR #1256, edit the top comment and add "Fixes #1231". That will help people who come across that PR later.

0f241db response to comments 1007a7a cannot predetermin ndims here 99ef07d Merge pull request #1241 from NCAR/free_new_allocs 29ed162 free recently allocated vars fbc3584 Merge pull request #1239 from NCAR/dontuse_nc_max 63dee3d Merge pull request #1240 from NCAR/limitto2GiB 64f2492 limit to 2GiB due to romio bug 29aee05 dont use NC_MAX values d831ad3 Merge pull request #1231 from mgduda/mpi_type_fix e996bdb Merge pull request #1222 from NCAR/ejh_autoconf_logging 426af22 Partial fix for incorrect type of 'mpi_type' in pioc_support.c 7eb724f added enable-logging option to autotools build git-subtree-dir: src/externals/pio2 git-subtree-split: 0f241db88cfee1912a2769a052dba0d2d79f83d5

Mvertens/drv flds in added mct/cime_config/namelist_definition_drv_flds.xml with updated schema removed bld directory updated schema for namelist_definition_drv_flds.xml put in error check that there are no duplicate entries in the drv drv_flds_in that have different values - verified that this works by having CLM change the same namelists that are set by CAM in drv_flds_in In addition to the scripts regression test - verified that the following tests were bfb with cesm2_0_alpha06f SMS_D_Ln9.f09_f09.FWAMIP.yellowstone_intel.cam-reduced_hist3s ERS_Ld7.f19_g16.B1850.yellowstone_intel.allactive-defaultio ERP_Ln9.f09_f09.FC55CLUBB.yellowstone_intel.cam-outfrq9s Test suite: scripts_regression_test Test baseline: Test namelist changes: Test status: bit for bit Fixes #1217 User interface changes?: None Code review: jedwards

ndkeen added the Machine Files label Jan 20, 2017

rljacob added the question label Feb 2, 2017

rljacob assigned ndkeen Feb 2, 2017

ndkeen mentioned this issue Feb 5, 2017

Improve default PE layouts on Edison #1241

Merged

ndkeen closed this as completed May 5, 2017

ndkeen mentioned this issue May 5, 2017

Adding flags to GNU DEBUG=TRUE builds #1256

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should we add floating-point traps to our DEBUG GNU builds? `-ffpe-trap=invalid,zero,overflow` #1231

Should we add floating-point traps to our DEBUG GNU builds? `-ffpe-trap=invalid,zero,overflow` #1231

ndkeen commented Jan 20, 2017 •

edited

Loading

mt5555 commented Jan 21, 2017

golaz commented Jan 23, 2017

rljacob commented Jan 23, 2017

ndkeen commented Jan 26, 2017

singhbalwinder commented Jan 26, 2017

ndkeen commented Jan 26, 2017

singhbalwinder commented Jan 27, 2017

ndkeen commented Feb 3, 2017 •

edited

Loading

ndkeen commented May 5, 2017

rljacob commented May 5, 2017

Should we add floating-point traps to our DEBUG GNU builds? -ffpe-trap=invalid,zero,overflow #1231

Should we add floating-point traps to our DEBUG GNU builds? -ffpe-trap=invalid,zero,overflow #1231

Comments

ndkeen commented Jan 20, 2017 • edited Loading

mt5555 commented Jan 21, 2017

golaz commented Jan 23, 2017

rljacob commented Jan 23, 2017

ndkeen commented Jan 26, 2017

singhbalwinder commented Jan 26, 2017

ndkeen commented Jan 26, 2017

singhbalwinder commented Jan 27, 2017

ndkeen commented Feb 3, 2017 • edited Loading

ndkeen commented May 5, 2017

rljacob commented May 5, 2017

Should we add floating-point traps to our DEBUG GNU builds? `-ffpe-trap=invalid,zero,overflow` #1231

Should we add floating-point traps to our DEBUG GNU builds? `-ffpe-trap=invalid,zero,overflow` #1231

ndkeen commented Jan 20, 2017 •

edited

Loading

ndkeen commented Feb 3, 2017 •

edited

Loading