Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial Spack env fails with segfault in MOM5 #6

Closed
penguian opened this issue Jun 10, 2024 · 4 comments
Closed

Initial Spack env fails with segfault in MOM5 #6

penguian opened this issue Jun 10, 2024 · 4 comments

Comments

@penguian
Copy link
Collaborator

The ACCESS-ESM1.5 pre-industrial configuration defined by access-esm1.5-configs, but using the executables created by the initial Spack environment defined by spack.yaml on the 2-spack-yaml branch, as per testing related to access-esm1.5-configs #16 fails with a SIGSEGV segmentation violation in all 180 MOM5 ranks. The segmentation violation is in the HDF5 H5T__init_native_float_types() function, when opening a NetCDF4 file.

[gadi-cpu-clx-1435:3643495:0:3643495] Caught signal 8 (Floating point exception: floating-point invalid operation)
[...]
[gadi-cpu-clx-1434:2689398:0:2689398] Caught signal 8 (Floating point exception: floating-point invalid operation)
==== backtrace (tid:1270086) ====
==== backtrace (tid:1270077) ====
 0 0x0000000000012cf0 __funlockfile()  :0
 1 0x00000000003ac858 H5T__init_native_float_types()  ???:0
 2 0x0000000000310908 H5T_init()  ???:0
 3 0x00000000003cff28 H5VL_init_phase2()  ???:0
 4 0x00000000000659c2 H5_init_library()  ???:0
 5 0x0000000000132ad5 H5Eset_auto2()  ???:0
 6 0x00000000000bbd8c nc4_hdf5_initialize()  ???:0
 7 0x00000000000c504c NC_HDF5_initialize()  ???:0
 8 0x0000000000028da8 nc_initialize()  ???:0
 9 0x000000000002ddfa NC_open()  ???:0
10 0x000000000002de3b nc__open()  ???:0
11 0x00000000000150e1 nf__open_()  /scratch/tm70/tm70_ci/tmp/restricted/spack-stage/spack-stage-netcdf-fortran-4.6.1-22f4qcf67piiovm4vtfrl5g54eb4zfzr/spack-src/fortran/nf_control.F90:228
12 0x000000000164a862 mpp_io_mod_mp_mpp_open_()  /scratch/tm70/tm70_ci/tmp/restricted/spack-stage/spack-stage-mom5-git.access-esm1.5_2024.05.24_access-esm1.5-ttg4y4yt3ddzhjywf5yfiicibk6xkx22/spack-src/src/shared/mpp/include/mpp_io_conne
ct.inc:510
13 0x000000000143af9c fms_io_mod_mp_get_file_unit_()  /scratch/tm70/tm70_ci/tmp/restricted/spack-stage/spack-stage-mom5-git.access-esm1.5_2024.05.24_access-esm1.5-ttg4y4yt3ddzhjywf5yfiicibk6xkx22/spack-src/src/shared/fms/fms_io.F90:5440
14 0x0000000001460d57 fms_io_mod_mp_field_exist_()  /scratch/tm70/tm70_ci/tmp/restricted/spack-stage/spack-stage-mom5-git.access-esm1.5_2024.05.24_access-esm1.5-ttg4y4yt3ddzhjywf5yfiicibk6xkx22/spack-src/src/shared/fms/fms_io.F90:5644
15 0x0000000001466bbc fms_io_mod_mp_fms_io_init_()  /scratch/tm70/tm70_ci/tmp/restricted/spack-stage/spack-stage-mom5-git.access-esm1.5_2024.05.24_access-esm1.5-ttg4y4yt3ddzhjywf5yfiicibk6xkx22/spack-src/src/shared/fms/fms_io.F90:524
16 0x0000000001400f45 fms_mod_mp_fms_init_()  /scratch/tm70/tm70_ci/tmp/restricted/spack-stage/spack-stage-mom5-git.access-esm1.5_2024.05.24_access-esm1.5-ttg4y4yt3ddzhjywf5yfiicibk6xkx22/spack-src/src/shared/fms/fms.F90:335
17 0x0000000000474f4a MAIN__()  /scratch/tm70/tm70_ci/tmp/restricted/spack-stage/spack-stage-mom5-git.access-esm1.5_2024.05.24_access-esm1.5-ttg4y4yt3ddzhjywf5yfiicibk6xkx22/spack-src/src/access_coupler/ocean_solo.F90:219
18 0x0000000000410262 main()  ???:0
19 0x000000000003ad85 __libc_start_main()  ???:0
20 0x000000000041016e _start()  ???:0
[...]
=================================
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source             
fms_ACCESS-CM.x    0000000001B69979  Unknown               Unknown  Unknown
libpthread-2.28.s  000014C0865CACF0  Unknown               Unknown  Unknown
fms_ACCESS-CM.x    0000000001B69D52  Unknown               Unknown  Unknown
libpthread-2.28.s  000014C0865CACF0  Unknown               Unknown  Unknown
libhdf5.so.310.3.  000014C085486858  H5T__init_native_     Unknown  Unknown
libhdf5.so.310.3.  000014C0853EA908  H5T_init              Unknown  Unknown
libhdf5.so.310.3.  000014C0854A9F28  H5VL_init_phase2      Unknown  Unknown
libhdf5.so.310.3.  000014C08513F9C2  H5_init_library       Unknown  Unknown
libhdf5.so.310.3.  000014C08520CAD5  H5Eset_auto2          Unknown  Unknown
libnetcdf.so.19.2  000014C087B1AD8C  nc4_hdf5_initiali     Unknown  Unknown
libnetcdf.so.19.2  000014C087B2404C  NC_HDF5_initializ     Unknown  Unknown
libnetcdf.so.19.2  000014C087A87DA8  nc_initialize         Unknown  Unknown
libnetcdf.so.19.2  000014C087A8CDFA  NC_open               Unknown  Unknown
libnetcdf.so.19.2  000014C087A8CE3B  nc__open              Unknown  Unknown
libnetcdff.so.7.2  000014C0875DA0E1  nf__open_             Unknown  Unknown
fms_ACCESS-CM.x    000000000164A862  mpp_io_mod_mp_mpp         510  mpp_io_connect.inc
fms_ACCESS-CM.x    000000000143AF9C  fms_io_mod_mp_get        5440  fms_io.F90
fms_ACCESS-CM.x    0000000001460D57  fms_io_mod_mp_fie        5644  fms_io.F90
fms_ACCESS-CM.x    0000000001466BBC  fms_io_mod_mp_fms         524  fms_io.F90
fms_ACCESS-CM.x    0000000001400F45  fms_mod_mp_fms_in         335  fms.F90
fms_ACCESS-CM.x    0000000000474F4A  MAIN__                    219  ocean_solo.F90
fms_ACCESS-CM.x    0000000000410262  Unknown               Unknown  Unknown
libc-2.28.so       000014C08622DD85  __libc_start_main     Unknown  Unknown
fms_ACCESS-CM.x    000000000041016E  Unknown               Unknown  Unknown
[...]
@penguian
Copy link
Collaborator Author

penguian commented Jun 11, 2024

The segfault is possibly caused by a known error introduced in hdf5-1.14.3 that is fixed in hdf5-1.14.4.
See HDFGroup/hdf5#4381 and HDFGroup/hdf5#3831

@penguian
Copy link
Collaborator Author

penguian commented Jun 11, 2024

The following change in packages/mom5/package.py results in a successful ACCESS-ESM1.5 pre-industrial run:

[pcl851@gadi-login-09 spack-packages]$ git diff
diff --git a/packages/mom5/package.py b/packages/mom5/package.py
index a36c149..6309dde 100644
--- a/packages/mom5/package.py
+++ b/packages/mom5/package.py
@@ -45,6 +45,9 @@ class Mom5(MakefilePackage):
         depends_on("libaccessom2~deterministic", when="~deterministic")
     with when("@access-esm1.5"):
         depends_on("oasis3-mct@access-esm1.5")
+        # Avoid segfault in HDF5 1.14.3
+        # https://github.com/HDFGroup/hdf5/issues/4381
+        depends_on("hdf5@:1.14.2,1.14.4:")
 
     phases = ["edit", "build", "install"]

@CodeGat
Copy link
Contributor

CodeGat commented Jul 4, 2024

Can we close this issue then @penguian ?

@penguian
Copy link
Collaborator Author

penguian commented Jul 4, 2024

Closed by #5 (when merged).

@penguian penguian closed this as completed Jul 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants