Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should we add floating-point traps to our DEBUG GNU builds? -ffpe-trap=invalid,zero,overflow #1231

Closed
ndkeen opened this issue Jan 20, 2017 · 10 comments

Comments

@ndkeen
Copy link
Contributor

ndkeen commented Jan 20, 2017

For Debug Intel builds, we use the -fpe0 flag which will stop the code on invalid, divide-by-zero, and overflow. However, I don't see these traps enabled for GNU DEBUG builds.
Should we add: -ffpe-trap=invalid,zero,overflow ?

This is from the man page for GNU fortran:

-ffpe-trap=list

Specify a list of floating point exception traps to enable. On most
systems, if a floating point exception occurs and the trap for that
exception is enabled, a SIGFPE signal will be sent and the program
being aborted, producing a core file useful for debugging. list is a
(possibly empty) comma-separated list of the following exceptions:
`invalid' (invalid floating point operation, such as SQRT(-1.0)),
`zero' (division by zero), `overflow' (overflow in a floating point
operation), `underflow' (underflow in a floating point operation),
`inexact' (loss of precision during operation), and `denormal'
(operation performed on a denormal value). The first five exceptions
correspond to the five IEEE 754 exceptions, whereas the last one
(`denormal') is not part of the IEEE 754 standard but is available on
some common architectures such as x86.

The first three exceptions (`invalid', `zero', and `overflow') often
indicate serious errors, and unless the program has provisions for
dealing with these exceptions, enabling traps for these three
exceptions is probably a good idea.

Many, if not most, floating point operations incur loss of precision
due to rounding, and hence the ffpe-trap=inexact is likely to be
uninteresting in practice.

By default no exception traps are enabled.

This is more man page for Intel Fortran:

-fpe0:

Floating-point invalid, divide-by-zero, and overflow exceptions are
enabled throughout the application when the main program s compiled
with this value. If any such exceptions ccur, execution is
aborted. This option causes enormalized floating-point results to be
set to ero. Underflow results will also be set to zero, unless ou
override this by explicitly specifying option no-ftz or -fp-model
precise (Linux* and OS X*) or ption /Qftz- or /fp:precise (Windows*).

Underflow results from SSE instructions, as well as x87 instructions,
will be set to zero. By contrast, option [Q]ftz only sets SSE
underflow results to zero.

Sets option -fp-speculation=strict (Linux* and OS X*) or
/Qfp-speculation:strict (Windows*) for any program unit compiled with
-fpe0 (Linux* and OS X*) or /fpe:0 (Windows*). This disables certain
optimizations in cases where speculative execution of floating-point
operations could lead to floating-point exceptions that would not
occur in the absence of speculation.  For example, this may prevent
the vectorization of some loops containing conditionals.

To get more detailed location information about where the error
occurred, use option traceback.

The default is -fpe3:
All floating-point exceptions are disabled.  Floating-point
underflow is gradual, unless you explicitly specify a compiler option
that enables flush-to-zero, such as [Q]ftz, O3, or O2. This setting
provides full IEEE support.
@mt5555
Copy link
Contributor

mt5555 commented Jan 21, 2017

i vote yes, we should enable this for GNU.

@golaz
Copy link
Contributor

golaz commented Jan 23, 2017

I'm in favor as well.

@rljacob
Copy link
Member

rljacob commented Jan 23, 2017

Yes. I don't see a problem with making this the default in the gnu entry for config_compilers.xml.
I think we just need to change this line:
<ADD_FFLAGS DEBUG="TRUE"> -g -Wall </ADD_FFLAGS>

@ndkeen
Copy link
Contributor Author

ndkeen commented Jan 26, 2017

I noted that we were light on debug flags for GNU. So I added the ones mentioned above as well -Og -fbacktrace -fcheck=bounds, such that a debug build for GNU would look like:

ifeq ($(DEBUG), TRUE)
   FFLAGS +=  -g -Wall -Og -fbacktrace -fcheck=bounds -ffpe-trap=invalid,zero,overflow
   CFLAGS +=  -g -Wall -Og -fbacktrace -fcheck=bounds -ffpe-trap=invalid,zero,overflow
endif

I read that -Og differs from -O0 in that -Og will try optimizations that should not affect debugging. Re-ran acme_dev. Obviously, only the DEBUG=TRUE tests of acme-dev would be affected.
One test ran out of time (submitted with more time). One test stopped for no reason -- submitted again. However, one test did fail with FP exception:

47:  rtm decomp info proc =           47  begr =       128482  endr =       131186  numr =         2705
47:                  proc =           47  begrl=        64763  endrl=        65406  numrl=          644
47:                  proc =           47  begro=        63720  endro=        65780  numro=         2061
47: 
47: Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
47: 
47: Backtrace for this error:
47: #0  0x1d3a4cf in ???
47:     at /home/abuild/rpmbuild/BUILD/glibc-2.19/nptl/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0
47: #1  0xe2c24c in __hetfrz_classnuc_cam_MOD_hetfrz_classnuc_cam_calc
47:     at /global/cscratch1/sd/ndk/wacmy/ndk_machinefiles_gnu-fpe-trap/components/cam/src/physics/cam/hetfrz_classnuc_cam.F90:967
47: #2  0xb68ef3 in __microp_aero_MOD_microp_aero_run
47:     at /global/cscratch1/sd/ndk/wacmy/ndk_machinefiles_gnu-fpe-trap/components/cam/src/physics/cam/microp_aero.F90:852
47: #3  0x5d6459 in tphysbc
47:     at /global/cscratch1/sd/ndk/wacmy/ndk_machinefiles_gnu-fpe-trap/components/cam/src/physics/cam/physpkg.F90:2407
47: #4  0x5de82a in __physpkg_MOD_phys_run1
47:     at /global/cscratch1/sd/ndk/wacmy/ndk_machinefiles_gnu-fpe-trap/components/cam/src/physics/cam/physpkg.F90:1010
47: #5  0x4db075 in __cam_comp_MOD_cam_run1
47:     at /global/cscratch1/sd/ndk/wacmy/ndk_machinefiles_gnu-fpe-trap/components/cam/src/control/cam_comp.F90:250
47: #6  0x4d4123 in __atm_comp_mct_MOD_atm_run_mct
47:     at /global/cscratch1/sd/ndk/wacmy/ndk_machinefiles_gnu-fpe-trap/components/cam/src/cpl/atm_comp_mct.F90:522
47: #7  0x42ac24 in __component_mod_MOD_component_run
47:     at /global/cscratch1/sd/ndk/wacmy/ndk_machinefiles_gnu-fpe-trap/cime/driver_cpl/driver/component_mod.F90:653
47: #8  0x4192b7 in __cesm_comp_mod_MOD_cesm_run
47:     at /global/cscratch1/sd/ndk/wacmy/ndk_machinefiles_gnu-fpe-trap/cime/driver_cpl/driver/cesm_comp_mod.F90:3251
47: #9  0x429c2c in cesm_driver
47:     at /global/cscratch1/sd/ndk/wacmy/ndk_machinefiles_gnu-fpe-trap/cime/driver_cpl/driver/cesm_driver.F90:67

Here is the code:

   call outfld('BCFREZIMM', nnuccc_bc, pcols, lchnk)
   call outfld('BCFREZCNT', nnucct_bc, pcols, lchnk)
   call outfld('BCFREZDEP', nnudep_bc, pcols, lchnk)

   call outfld('NIMIX_IMM', niimm_bc+niimm_dst, pcols, lchnk)
   call outfld('NIMIX_CNT', nicnt_bc+nicnt_dst, pcols, lchnk)   
   call outfld('NIMIX_DEP', nidep_bc+nidep_dst, pcols, lchnk) !ndk SIGFPE: Floating-point exception - erroneous arithmetic operation. GNU/KNL

   call outfld('DSTNICNT', nicnt_dst, pcols, lchnk)
   call outfld('DSTNIDEP', nidep_dst, pcols, lchnk)
   call outfld('DSTNIIMM', niimm_dst, pcols, lchnk)

   call outfld('BCNICNT', nicnt_bc, pcols, lchnk)
   call outfld('BCNIDEP', nidep_bc, pcols, lchnk)
   call outfld('BCNIIMM', niimm_bc, pcols, lchnk)

@singhbalwinder
Copy link
Contributor

Hi @ndkeen ,

Would you please try to submit the run again by commenting out the offending line to see if the model runs fine otherwise? The same quantities, nidep_bc and nidep_dst, are sent to outfld after this call, we might find which one of these is corrupt.

@ndkeen
Copy link
Contributor Author

ndkeen commented Jan 26, 2017

OK, I had sent a quick message to NERSC consultants regarding the other job that failed without info and they said that it looked like the job ran out of memory. Of course, I asked how they are getting this valuable information and why can't we have it... But sure, enough, that job as well as this one passed when I doubled the nodes I asked for (and slurm will evenly split up the MPI's, so I effectively get double the memory). I guess it makes sense that adding more debugging flags coudl use more memory, but good to know how close to the edge we are. I will see if it's easy to request more nodes for these problems in default PE layouts for cori-knl.

So false alarm.

@singhbalwinder
Copy link
Contributor

That is good to know. Thanks Noel!

@ndkeen
Copy link
Contributor Author

ndkeen commented Feb 3, 2017

Still trying to find a PE layout that makes all tests happy for cori-knl. Running acme_developer on edison now.

All of the tests passed on edison (except HOMME is now failing to link, but unrelated to this change).

I can go ahead and merge this to let it be tested on other machines. If it catches something it did not before, that's good, but it could require a little more memory and might cause a fail.

agsalin pushed a commit that referenced this issue Apr 13, 2017
Mvertens/drv flds in
added mct/cime_config/namelist_definition_drv_flds.xml with updated schema

removed bld directory
updated schema for namelist_definition_drv_flds.xml
put in error check that there are no duplicate entries in the drv drv_flds_in that have different values - verified that this works by having CLM change the same namelists that are set by CAM in drv_flds_in
In addition to the scripts regression test - verified that the following tests were bfb with cesm2_0_alpha06f

SMS_D_Ln9.f09_f09.FWAMIP.yellowstone_intel.cam-reduced_hist3s
ERS_Ld7.f19_g16.B1850.yellowstone_intel.allactive-defaultio
ERP_Ln9.f09_f09.FC55CLUBB.yellowstone_intel.cam-outfrq9s

Test suite: scripts_regression_test
Test baseline:
Test namelist changes:
Test status: bit for bit

Fixes #1217

User interface changes?: None

Code review: jedwards
@ndkeen
Copy link
Contributor Author

ndkeen commented May 5, 2017

Again, forgot to reference this issue with my PR.
#1256

@ndkeen ndkeen closed this as completed May 5, 2017
@rljacob
Copy link
Member

rljacob commented May 5, 2017

Go back to PR #1256, edit the top comment and add "Fixes #1231". That will help people who come across that PR later.

jgfouca pushed a commit that referenced this issue Jan 8, 2018
0f241db response to comments
1007a7a cannot predetermin ndims here
99ef07d Merge pull request #1241 from NCAR/free_new_allocs
29ed162 free recently allocated vars
fbc3584 Merge pull request #1239 from NCAR/dontuse_nc_max
63dee3d Merge pull request #1240 from NCAR/limitto2GiB
64f2492 limit to 2GiB due to romio bug
29aee05 dont use NC_MAX values
d831ad3 Merge pull request #1231 from mgduda/mpi_type_fix
e996bdb Merge pull request #1222 from NCAR/ejh_autoconf_logging
426af22 Partial fix for incorrect type of 'mpi_type' in pioc_support.c
7eb724f added enable-logging option to autotools build

git-subtree-dir: src/externals/pio2
git-subtree-split: 0f241db88cfee1912a2769a052dba0d2d79f83d5
rljacob pushed a commit that referenced this issue Apr 12, 2021
Mvertens/drv flds in
added mct/cime_config/namelist_definition_drv_flds.xml with updated schema

removed bld directory
updated schema for namelist_definition_drv_flds.xml
put in error check that there are no duplicate entries in the drv drv_flds_in that have different values - verified that this works by having CLM change the same namelists that are set by CAM in drv_flds_in
In addition to the scripts regression test - verified that the following tests were bfb with cesm2_0_alpha06f

SMS_D_Ln9.f09_f09.FWAMIP.yellowstone_intel.cam-reduced_hist3s
ERS_Ld7.f19_g16.B1850.yellowstone_intel.allactive-defaultio
ERP_Ln9.f09_f09.FC55CLUBB.yellowstone_intel.cam-outfrq9s

Test suite: scripts_regression_test
Test baseline:
Test namelist changes:
Test status: bit for bit

Fixes #1217

User interface changes?: None

Code review: jedwards
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants