Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable FPE traps in CI #611

Merged
merged 42 commits into from
Jul 11, 2024
Merged

enable FPE traps in CI #611

merged 42 commits into from
Jul 11, 2024

Conversation

BenWibking
Copy link
Collaborator

@BenWibking BenWibking commented Apr 17, 2024

Description

Enables signal handling and floating-point exception (FPE) traps for NANs when running the test suite, so it will stop and report an error when NANs are encountered. (This only works on CPUs.)

In order for this to work, we change the signal handling settings back to AMReX defaults. We also have to suppress FPEs when importing numpy (see numpy/numpy#20504).

By default, many compilers perform optimizations that assume that floating-point exceptions are disabled, and will often produce spurious FPEs for vectorized code (however, this does not affect the results in any way). There are compiler-specific options to disable this behavior for Intel and Clang.

Note that AMReX_FPE=ON only sets the compiler option -ftrapv (or similar). We instead add the runtime options amrex.fpe_trap_invalid=1, amrex.fpe_trap_zero=1, and amrex.fpe_trap_overflow=1 to the command-line arguments for each test. These options can also be added to the input file (or command line) to debug individual problems.

Depends on:

Related issues

Partially resolves #556.

Checklist

Before this pull request can be reviewed, all of these tasks should be completed. Denote completed tasks with an x inside the square brackets [ ] in the Markdown source below:

  • I have added a description (see above).
  • I have added a link to any related issues see (see above).
  • I have read the Contributing Guide.
  • I have added tests for any new physics that this PR adds to the code.
  • I have tested this PR on my local computer and all tests pass.
  • I have manually triggered the GPU tests with the magic comment /azp run.
  • I have requested a reviewer for this PR.

@BenWibking
Copy link
Collaborator Author

/azp run

Copy link

Azure Pipelines successfully started running 5 pipeline(s).

@psharda psharda added enhancement New feature or request floating-point issue with floating-point roundoff error CI labels Apr 17, 2024
@psharda
Copy link
Contributor

psharda commented Apr 29, 2024

/azp run

Copy link

Azure Pipelines successfully started running 5 pipeline(s).

@BenWibking BenWibking changed the title enable FPE traps in CI [WIP] enable FPE traps in CI Jun 29, 2024
@BenWibking BenWibking marked this pull request as ready for review July 2, 2024 00:54
@BenWibking BenWibking changed the title [WIP] enable FPE traps in CI enable FPE traps in CI Jul 2, 2024
@BenWibking BenWibking requested a review from markkrumholz July 2, 2024 01:11
@BenWibking BenWibking enabled auto-merge July 2, 2024 01:13
Copy link
Collaborator

@markkrumholz markkrumholz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks fine, but there is a design question we should consider. Producing a NaN is always an error, but producing underflows or overflows may not be. I'm thinking in particular of chemistry, where we have a lot of rate coefficients that take the form rate ~ exp(-T_activation / T_gas), where T_activation is the activation temperature for some reaction. At low T_gas this will underflow and give a rate coefficient of exactly zero, but that is the desired behavior. Similarly, for multigroup radiation we may well have groups where the exponential term in the Planck function underflows to zero in low-temperature parts of the simulation domain, and that is again the desired behavior. Thus I'm worried that a FPE that triggers on underflow or overflow may be too sensitive. Thoughts on this issue?

@BenWibking
Copy link
Collaborator Author

Code looks fine, but there is a design question we should consider. Producing a NaN is always an error, but producing underflows or overflows may not be. I'm thinking in particular of chemistry, where we have a lot of rate coefficients that take the form rate ~ exp(-T_activation / T_gas), where T_activation is the activation temperature for some reaction. At low T_gas this will underflow and give a rate coefficient of exactly zero, but that is the desired behavior. Similarly, for multigroup radiation we may well have groups where the exponential term in the Planck function underflows to zero in low-temperature parts of the simulation domain, and that is again the desired behavior. Thus I'm worried that a FPE that triggers on underflow or overflow may be too sensitive. Thoughts on this issue?

By design, it should only trap on overflow, not underflow.

However, I think the underlying compiler option this turns on is not actually equivalent to the runtime behavior that is enabled with amrex.fpe_trap_invalid, amrex.fpe_trap_overflow and amrex.fpe_trap_zero (division-by-zero), so I need to re-work this PR and test it explicitly.

@BenWibking BenWibking disabled auto-merge July 2, 2024 13:02
@BenWibking BenWibking marked this pull request as draft July 2, 2024 13:02
@BenWibking
Copy link
Collaborator Author

BenWibking commented Jul 2, 2024

The Intel compiler produces spurious FPEs for optimized code, will have to disable for that.

Ref: https://stackoverflow.com/questions/52644343/replacing-01-with-02-intel-compiler-option-leads-to-throwing-fpe-at-small/52652657#52652657

@BenWibking
Copy link
Collaborator Author

BenWibking commented Jul 2, 2024

May need to turn on -ffp-exception-behavior=maytrap with Clang compilers (i.e., for Intel and macOS tests): https://clang.llvm.org/docs/UsersManual.html#cmdoption-ffp-exception-behavior

@BenWibking
Copy link
Collaborator Author

This is currently failing because of a division by zero in matplotlib:

* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_INSTRUCTION (code=1, subcode=0x1e611800)
  * frame #0: 0x0000000107b0636c _multiarray_umath.cpython-312-darwin.so`longdouble_true_divide + 364
    frame #1: 0x00000001018b1aac Python`binary_op1 + 88
    frame #2: 0x00000001018b1e9c Python`PyNumber_TrueDivide + 32
    frame #3: 0x00000001019c3988 Python`_PyEval_EvalFrameDefault + 53224
    frame #4: 0x00000001019b6730 Python`PyEval_EvalCode + 184
    frame #5: 0x00000001019b2fe4 Python`builtin_exec + 416
    frame #6: 0x000000010191e4b8 Python`cfunction_vectorcall_FASTCALL_KEYWORDS + 92
    frame #7: 0x00000001019c2f20 Python`_PyEval_EvalFrameDefault + 50560
    frame #8: 0x00000001018cf430 Python`object_vacall + 212
    frame #9: 0x00000001018cf310 Python`PyObject_CallMethodObjArgs + 104
    frame #10: 0x00000001019f5fc0 Python`PyImport_ImportModuleLevelObject + 996
    frame #11: 0x00000001019b1940 Python`builtin___import__ + 172
    frame #12: 0x000000010191e4b8 Python`cfunction_vectorcall_FASTCALL_KEYWORDS + 92
    frame #13: 0x00000001019c2f20 Python`_PyEval_EvalFrameDefault + 50560
    frame #14: 0x00000001018cf430 Python`object_vacall + 212
    frame #15: 0x00000001018cf310 Python`PyObject_CallMethodObjArgs + 104
    frame #16: 0x00000001019f5fc0 Python`PyImport_ImportModuleLevelObject + 996
    frame #17: 0x00000001019b1940 Python`builtin___import__ + 172
    frame #18: 0x000000010191e4b8 Python`cfunction_vectorcall_FASTCALL_KEYWORDS + 92
    frame #19: 0x00000001019c2f20 Python`_PyEval_EvalFrameDefault + 50560
    frame #20: 0x00000001018cf430 Python`object_vacall + 212
    frame #21: 0x00000001018cf310 Python`PyObject_CallMethodObjArgs + 104
    frame #22: 0x00000001019f5fc0 Python`PyImport_ImportModuleLevelObject + 996
    frame #23: 0x00000001019b1940 Python`builtin___import__ + 172
    frame #24: 0x000000010191e4b8 Python`cfunction_vectorcall_FASTCALL_KEYWORDS + 92
    frame #25: 0x00000001018ce81c Python`_PyObject_CallFunctionVa + 152
    frame #26: 0x00000001018ce778 Python`PyObject_CallFunction + 60
    frame #27: 0x00000001019f5a00 Python`PyImport_Import + 192
    frame #28: 0x00000001019f58e4 Python`PyImport_ImportModule + 60
    frame #29: 0x000000010000d124 test_advection`_import_array() at __multiarray_api.h:1482:21
    frame #30: 0x000000010000ce2c test_advection`matplotlibcpp::detail::_interpreter::import_numpy(this=0x0000000100510000) at matplotlibcpp.h:126:3
    frame #31: 0x000000010000b9b4 test_advection`matplotlibcpp::detail::_interpreter::_interpreter(this=0x0000000100510000) at matplotlibcpp.h:160:3
    frame #32: 0x000000010000b8d8 test_advection`matplotlibcpp::detail::_interpreter::_interpreter(this=0x0000000100510000) at matplotlibcpp.h:141:2
    frame #33: 0x000000010000a764 test_advection`matplotlibcpp::detail::_interpreter::get() at matplotlibcpp.h:103:23
    frame #34: 0x000000010000a43c test_advection`_object* matplotlibcpp::get_array<double>(v=size=400) at matplotlibcpp.h:320:2
    frame #35: 0x000000010000a1cc test_advection`bool matplotlibcpp::plot<double>(x=size=400, y=size=400, keywords=size=1) at matplotlibcpp.h:375:21
    frame #36: 0x000000010000529c test_advection`matplotlibcpp::plot(x=size=400, y=size=400, keywords=size=1) at matplotlibcpp.h:1870:9
    frame #37: 0x00000001000046d8 test_advection`AdvectionSimulation<SawtoothProblem>::computeReferenceSolution(this=0x000000016fdfdfb8, ref=0x000000016fdfd1b8, dx=0x000000016fdfd170, prob_lo=0x000000016fdfd168, prob_hi=0x000000016fdfd160) at test_advection.cpp:110:3

@BenWibking
Copy link
Collaborator Author

FPEs do not seem to work on Linux ARM64.

@BenWibking
Copy link
Collaborator Author

/azp run

@BenWibking BenWibking marked this pull request as ready for review July 7, 2024 21:28
@dosubot dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Jul 7, 2024
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@BenWibking BenWibking requested a review from psharda July 7, 2024 21:31
@BenWibking
Copy link
Collaborator Author

These GPU tests fail:

The following tests FAILED:
	 30 - BinaryOrbitCIC (Failed)
	 37 - PopIII (Failed)
	 39 - StarCluster (Failed)

@BenWibking BenWibking marked this pull request as draft July 7, 2024 23:06
@BenWibking
Copy link
Collaborator Author

/azp run

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@BenWibking
Copy link
Collaborator Author

BenWibking commented Jul 8, 2024

Last remaining test failures on avatargpu:

The following tests FAILED:
	 37 - PopIII (Failed)
	 39 - StarCluster (Failed)

It appears to fail when reading the turbulence driving field:

CUDA initialized with 1 device.
AMReX (24.06) initialized
Successfully read inputs file ... 
Using derived variables:
	log_density

Creating new distribution map on level: 0
Initializing turbulence data...
data_file: zdrv.hdf5.
Erroneous arithmetic operation
See Backtrace.0 file for details

It's caused by an FPE inside the HDF5 library :/
See HDFGroup/hdf5#3831.

@BenWibking
Copy link
Collaborator Author

/azp run

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@BenWibking BenWibking marked this pull request as ready for review July 8, 2024 18:30
@BenWibking
Copy link
Collaborator Author

@markkrumholz Can you review this PR?

@BenWibking BenWibking enabled auto-merge July 10, 2024 22:52
@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Jul 10, 2024
@BenWibking BenWibking added this pull request to the merge queue Jul 10, 2024
Merged via the queue into development with commit 14469c0 Jul 11, 2024
18 of 19 checks passed
@BenWibking BenWibking deleted the BenWibking/enable-fpe-in-ci branch July 11, 2024 14:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI enhancement New feature or request floating-point issue with floating-point roundoff error lgtm This PR has been approved by a maintainer size:M This PR changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

compile and run test suite with hipcc + FPE traps on CPU
3 participants