Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation Fault Issue while running the latest image created using four Dockerfiles (arm) #21

Closed
arpita0911patel opened this issue Aug 22, 2023 · 26 comments
Assignees
Labels
bug Something isn't working

Comments

@arpita0911patel
Copy link
Member

arpita0911patel commented Aug 22, 2023

Steps to reproduce the issue:

  1. Built the image using four Dockerfiles and tagged it as latest as below:

docker build --platform linux/arm64 --tag awiciroh/ngen-deps:latest -f Dockerfile.ngen-deps .
docker build --platform linux/arm64 --tag awiciroh/t-route:latest -f Dockerfile.t-route .
docker build --platform linux/arm64 --tag awiciroh/ngen:latest -f Dockerfile.ngen .
docker build --platform linux/arm64 --tag awiciroh/ciroh-ngen-image:latest -f Dockerfile .

docker push awiciroh/ngen-deps:latest
docker push awiciroh/t-route:latest
docker push awiciroh/ngen:latest
docker push awiciroh/ciroh-ngen-image:latest

  1. Input data used: our sample data on S3 bucket: $ wget --no-parent https://ciroh-ua-ngen-data.s3.us-east-2.amazonaws.com/AWI-001/AWI_03W_113060_001.tar.gz

  2. Updated guide.sh to use the latest image.

  3. While trying to run the guide.sh script on Mac laptop, its pulling latest image but seeing below error:

image

@arpita0911patel arpita0911patel added the bug Something isn't working label Aug 22, 2023
@arpita0911patel
Copy link
Member Author

arpita0911patel commented Aug 22, 2023

After uncommenting the tests lines we are seeing below error while trying to build the ngen image using Dockerfile.ngen:

error : #0 248.1 The following tests FAILED:
#0 248.1 550 - RoutingPyBindTest.TestRoutingPyBind (Failed)

image

@arpita0911patel arpita0911patel changed the title Segmentation Fault Issue while running the latest image created using four Dockerfiles Segmentation Fault Issue while running the latest image created using four Dockerfiles (arm) Aug 22, 2023
@benlee0423
Copy link

@arpita0911patel
The test error already discussed in the slack channel at the end of May. Nel mentioned it is due to RELATIVE paths, so this breaks down when using CTest from the root. It was fixed/accepted/merged upstream, then reverted back because one of the OWP GH actions failed.
https://cirohworkspace.slack.com/archives/C040S06TJG1/p1685455201170639

@arpita0911patel
Copy link
Member Author

We now have access to Github large runners so we will test it out here and update on this ticket.

@JoshCu
Copy link
Collaborator

JoshCu commented Oct 13, 2023

I've reproduced this and discovered the following:

Hardware:

arm64

mac mini M2 Pro
22.5.0 Darwin Kernel Version 22.5.0: Thu Jun 8 22:22:23 PDT 2023; root:xnu-8796.121.3~7/RELEASE_ARM64_T6020 arm64

amd64/x86

dell 7810 dual xeon e5-2697 5.15.0-86-generic #96-Ubuntu SMP Wed Sep 20 08:23:49 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Dockerhub Images

awiciroh/ciroh-ngen-image:latest (arm64)

Segfaults when routing begins

awiciroh/ciroh-ngen-image:latest-arm (arm64)

Runs, but output csv files contain date times, then three columns of zeros

awiciroh/ciroh-ngen-image:latest-x86 (amd64)

Runs, but output csv files only contain one column of data.

Building from scratch

Neither arm64 or amd64/x86 will build
Dockerfiles used were from the main branch this version
For both architectures, I ran the following commands (with linux/amd64 on the xeon machine)

sudo docker build --platform linux/arm64 --tag awiciroh/ngen-deps:latest -f Dockerfile.ngen-deps .
sudo docker build --platform linux/arm64 --tag awiciroh/t-route:latest -f Dockerfile.t-route .
sudo docker build --platform linux/arm64 --tag awiciroh/ngen:latest -f Dockerfile.ngen .

both failed to build the ngen docker file at the same step 21/21
amd64_x86.log
arm64.log

I'm going to break the chained bash command down into multiple docker RUN commands, then see if I can fix the issue via interactive shell

@arpita0911patel
Copy link
Member Author

It might be using the ngen-deps image from the Dockerhub for building t-route image and same way for building ngen image it might be using Dockerhub t-route image.

@arpita0911patel
Copy link
Member Author

I will try building the latest images and push to Dockerhub.

@arpita0911patel
Copy link
Member Author

I am seeing the same error as you

Arpita-Mac : Darwin UA-W2RP43G 22.6.0 Darwin Kernel Version 22.6.0: Wed Jul  5 22:22:05 PDT 2023; root:xnu-8796.141.3~6/RELEASE_ARM64_T6000 arm64
 
Getting same error while trying to build ngen image:

ERROR: failed to solve: process "/bin/sh -c cd ${WORKDIR}/ngen     && if [ "${NGEN_ACTIVATE_PYTHON}" == "ON" ]; then         pip3 install -r extern/test_bmi_py/requirements.txt;         if [ "${NGEN_ROUTING_ACTIVE}" == "ON" ] ; then             pip3 install /tmp/t-route-wheels/.whl;             pip3 install -r /tmp/t-route-requirements.txt;             pip3 install deprecated geopandas ;             fi;         fi     &&  if [ "${NGEN_ACTIVATE_FORTRAN}" == "ON" ]; then                 ./build_sub extern/iso_c_fortran_bmi;                 if [ "${BUILD_NOAH_OWP}" == "true" ] ; then ./build_sub extern/noah-owp-modular; fi;         fi     &&  if [ "${NGEN_ACTIVATE_C}" == "ON" ]; then                 if [ "${BUILD_CFE}" == "true" ] ; then ./build_sub extern/cfe; fi;                 if [ "${BUILD_PET}" == "true" ] ; then ./build_sub extern/evapotranspiration/evapotranspiration; fi;                 if [ "${BUILD_TOPMODEL}" == "true" ] ; then ./build_sub extern/topmodel; fi;         fi     && if [ "${BUILD_SLOTH}" == "true" ] ; then ./build_sub extern/sloth; fi     && if [ "${BUILD_NGEN_SERIAL}" == "true" ]; then         cmake -B cmake_build_serial -S .         -DMPI_ACTIVE:BOOL=OFF         -DNETCDF_ACTIVE:BOOL=${NGEN_NETCDF_ACTIVE}         -DBMI_C_LIB_ACTIVE:BOOL=${NGEN_ACTIVATE_C}         -DBMI_FORTRAN_ACTIVE:BOOL=${NGEN_ACTIVATE_FORTRAN}         -DNGEN_ACTIVATE_PYTHON:BOOL=${NGEN_ACTIVATE_PYTHON}         -DNGEN_ACTIVATE_ROUTING:BOOL=${NGEN_ROUTING_ACTIVE}         -DUDUNITS_ACTIVE:BOOL=${NGEN_UDUNITS_ACTIVE}         -DUDUNITS_QUIET:BOOL=${NGEN_UDUNITS_QUIET}         -DCMAKE_INSTALL_PREFIX=${WORKDIR}         -DNETCDF_INCLUDE_DIR=/usr/include         -DNETCDF_LIBRARY=/usr/lib/libnetcdf.so         -DNETCDF_CXX_INCLUDE_DIR=/usr/local/include         -DNETCDF_CXX_LIBRARY=/usr/local/lib64/libnetcdf-cxx4.so ;     fi     && if [ "${BUILD_NGEN_PARALLEL}" == "true" ]; then         cmake -B cmake_build_parallel -S .         -DMPI_ACTIVE:BOOL=ON         -DNETCDF_ACTIVE:BOOL=${NGEN_NETCDF_ACTIVE}         -DBMI_C_LIB_ACTIVE:BOOL=${NGEN_ACTIVATE_C}         -DBMI_FORTRAN_ACTIVE:BOOL=${NGEN_ACTIVATE_FORTRAN}         -DNGEN_ACTIVATE_PYTHON:BOOL=${NGEN_ACTIVATE_PYTHON}         -DNGEN_ACTIVATE_ROUTING:BOOL=${NGEN_ROUTING_ACTIVE}         -DUDUNITS_ACTIVE:BOOL=${NGEN_UDUNITS_ACTIVE}         -DUDUNITS_QUIET:BOOL=${NGEN_UDUNITS_QUIET}         -DCMAKE_INSTALL_PREFIX=${WORKDIR}         -DNETCDF_INCLUDE_DIR=/usr/include         -DNETCDF_LIBRARY=/usr/lib/libnetcdf.so         -DNETCDF_CXX_INCLUDE_DIR=/usr/local/include         -DNETCDF_CXX_LIBRARY=/usr/local/lib64/libnetcdf-cxx4.so ;     fi     && ln -s $(if [ "${BUILD_NGEN_PARALLEL}" == "true" ]; then echo "cmake_build_parallel"; else echo "cmake_build_serial"; fi) cmake_build     && ./build_sub extern/test_bmi_cpp     &&  if [ "${NGEN_ACTIVATE_C}" == "ON" ]; then             ./build_sub extern/test_bmi_c;         fi     &&  if [ "${NGEN_ACTIVATE_FORTRAN}" == "ON" ]; then             ./build_sub extern/test_bmi_fortran;         fi     &&  for BUILD_DIR in $(if [ "${BUILD_NGEN_PARALLEL}" == "true" ]; then echo "cmake_build_parallel"; fi) $(if [ "${BUILD_NGEN_SERIAL}" == "true" ]; then echo "cmake_build_serial"; fi) ; do         cmake --build $BUILD_DIR --target all -j $(nproc);     done     && cd ${WORKDIR}/ngen     && rm -f ./test/data/routing/.parquet     && rm -f ./test/data/routing/.parquet     && mpirun -n 2 cmake_build_parallel/test/test_remote_nexus     && mpirun -n 3 cmake_build_parallel/test/test_remote_nexus     && mpirun -n 4 cmake_build_parallel/test/test_remote_nexus     && find cmake_build -type f -name "" ! \( -name ".so" -o -name "ngen" -o -name "partitionGenerator" \) -exec rm {} +" did not complete successfully: exit code: 2
 

@JoshCu
Copy link
Collaborator

JoshCu commented Oct 13, 2023

It's exactly this #21 (comment)

The makefile at /ngen/ngen/extern/test_bmi_cpp/cmake_build/CMakeFiles/testbmicppmodel.dir/build.make
uses relative paths which expect you to be in the /ngen/ngen/extern/test_bmi_cpp/cmake_build/ folder.
But the build is executed from /ngen/ngen/cmake_build_parallel or serial.

I've managed to patch the make file manually and build ngen, but it's within the running docker container.

I'll update the dockerfile to work around this properly momentarily.
image

@JoshCu
Copy link
Collaborator

JoshCu commented Oct 13, 2023

I still need to test the functionality of the image, but it's fixed enough to build the docker image for x86 in my fork
image

image

I'll continue to look into this to make sure it works for arm64 over the weekend.
I'll also see if I can find a better way to fix this, editing make files manually isn't a good way to resolve this.

For arm I suspect the segfault was because of the known issue with running ngen t-route mentioned here

I'll have a look at building the image with multiprocessing disabled and configuring the dockerfile to use a separate installation of t-route.

@hellkite500
Copy link
Collaborator

These makefiles are generated by cmake, I would strongly suggest not using this solution as a workaround.

I would suspect the cmake build command can be adjusted to fix this. There have been some upstream changes to the build system to simplify some things and unify build options. Something may have changed enough to make the build in these docker files not work correctly.

@TrupeshKumarPatel
Copy link
Collaborator

TrupeshKumarPatel commented Nov 17, 2023

@hellkite500, I am attaching the log from the gdb run. I hope this helps. It seems like this line is throwing a SIGSEGV signal which is technically a segmentation fault.

gdb.txt

In the log, you see at the bottom I was able to print std::string t_route_config_file_with_path value which is$1 = "/ngen/ngen/data/config/ngen.yaml", but exactly after that we get a segmentation fault.

Now, looking at line#305 it seems it's trying to make a unique pointer but failed to do so.

@hellkite500
Copy link
Collaborator

Can you generate a back trace from gdb?
After the fault just type bt in the gdb prompt.

@TrupeshKumarPatel
Copy link
Collaborator

Here is log with back trace:
gdb.txt

@hellkite500
Copy link
Collaborator

Thanks for that stack trace! So I have seen this issue pop up in another environment as well, and the "fix" isn't clear. This is not an ngen nor a t-route specific error as far as I can tell.

The stack trace shows the last few calls as

#0  0x00007ffff746a2fc in __strlen_evex () from /lib64/libc.so.6
#1  0x00007ffff744dd43 in strdup () from /lib64/libc.so.6
#2  0x00007ffff7daffe6 in NC_rcfile_insert () from /lib64/libnetcdf.so.19
#3  0x00007fffd3dadb0b in nc_rc_set ()
   from /usr/local/lib64/python3.9/site-packages/netCDF4/../netCDF4.libs/libnetcdf-e7f569b4.so.19
#4  0x00007fffd8137b05 in ?? ()
   from /usr/local/lib64/python3.9/site-packages/netCDF4/_netCDF4.cpython-39-x86_64-linux-gnu.so

t-route imports and uses the python netcdf module, which is an extension module linked to the .so in #4. This in turn seems to be interfacing a netcdf C library built/distributed with the python wheel (in the python site packages.). For whatever reason, when this module is initialized at import, it is trying to establish some ssl cert paths and put some information in a netcdf config file (presumably to support https URL reading). In the process, the backend netcdf C library is calling strdup to COPY a string (set of bytes) from an input buffer, and this buffer is not properly allocated, causing a segmentation fault.

There is a small possibility that something in the ngen initialization is causes enough stack corruption to trigger this issue, but I haven't seen any other evidence of this. Until we get a better root cause of the error, it might be worth trying to upgrade or downgrade the python netcdf version in your environment to see if this issue exists in other versions.

It would also help if you could add to this thread what your python/netcdf environment and versions ect look like.

@TrupeshKumarPatel
Copy link
Collaborator

command: nc-config --all

nc-config --all

This netCDF 4.8.1 has been built with the following features:

  --cc            -> gcc
  --cflags        -> -I/usr/include -I/usr/include/hdf -DH5_USE_110_API
  --libs          -> -L/usr/lib64 -lnetcdf
  --static        -> -ljpeg -lmfhdf -ldf -ljpeg -lhdf5_hl -lhdf5 -lm -lz -lcurl -ltirpc

  --has-c++       -> no
  --cxx           ->

  --has-c++4      -> yes
  --cxx4          -> /usr/bin/c++
  --cxx4flags     -> -I/usr/local/include
  --cxx4libs      -> -L/usr/local/lib64 -lnetcdf-cxx4 -lnetcdf

  --has-fortran   -> yes
  --fc            -> gfortran
  --fflags        -> -I/usr/lib64/gfortran/modules
  --flibs         -> -lnetcdff
  --has-f90       ->
  --has-f03       -> yes

  --has-dap       -> yes
  --has-dap2      -> yes
  --has-dap4      -> yes
  --has-nc2       -> yes
  --has-nc4       -> yes
  --has-hdf5      -> yes
  --has-hdf4      -> yes
  --has-logging   -> no
  --has-pnetcdf   -> no
  --has-szlib     -> yes
  --has-cdf5      -> yes
  --has-parallel4 -> no
  --has-parallel  -> no
  --has-nczarr    -> yes

  --prefix        -> /usr
  --includedir    -> /usr/include
  --libdir        -> /usr/lib64
  --version       -> netCDF 4.8.1

command: python --version

python --version
Python 3.9.16

Could you tell what version requirements are for all the packages?

@hellkite500
Copy link
Collaborator

Interesting... The python lib isn't using the system netcdf library. Can you also do pip list | grep netcdf?

@TrupeshKumarPatel
Copy link
Collaborator

command: pip list | grep netCDF

pip list | grep netCDF
netCDF4                        1.6.5

@hellkite500
Copy link
Collaborator

Can you try a build based on #45? This might resolve the problem...

@JoshCu
Copy link
Collaborator

JoshCu commented Nov 18, 2023

I believe this segfault is technically resolved, but there's still more to do as I only have it working on x86 and only serially without mpi.

TLDR: made a new set of Docker files that use Rocky Linux 9.2, the latest commits to main/master of troute and ngen, and updated ngen.yaml in the example files to work with troute V4. on my fork here


In the image, I write pip freeze and rpm -qa to files so we have an easy access record of the package versions.

Current Working Versions:

If you want to try out what I've got so far (absolutely will not build on arm, first file contains a link to an x86 .so)

  1. Clone the repository and checkout the simplify_docker branch:
    git clone https://github.com/JoshCu/NGIAB-CloudInfra/tree/simplify_docker
    git checkout simplify_docker
    (simplify was presumptuous of me.)
  2. Download the AWI tarball as mentioned in the readme.
  3. Unzip it and create directories:
    mkdir AWI_03W_113060_001/lakeout
    mkdir AWI_03W_113060_001/restart
  4. Run ./guide.sh. It will pull my latest built image and run it. The image can be found here on Docker Hub.
    • There's also a joshcu/ngiab-dev:full image that doesn't have the final stage to reduce the image size. It's a huge image, but includes every generated file and package used to build ngen and troute. Currently it's still uploading, hopefully it should be done by the time anyone sees this.

If you want to build the image locally:

  • CD into the "docker" folder and run the following commands:
    docker build -f Dockerfile.t-route -t local/t-route_wheel --target t-route_wheel . --no-cache
    docker build -f Dockerfile.ngen -t local/ngen . --no-cache
    docker build -f Dockerfile -t joshcu/ngiab-dev . --no-cache
    For the image with everything in it, add --target ngen-dev onto the last docker build command.

And if you happen to want to break it again, instructions are here.

@TrupeshKumarPatel
Copy link
Collaborator

Can you try a build based on #45? This might resolve the problem...

I tried this but doesn’t seems to help.

@hellkite500
Copy link
Collaborator

Do you get the same stack trace?

@TrupeshKumarPatel
Copy link
Collaborator

yes, I get the same trace.

Thread 1 "ngen-serial" received signal SIGSEGV, Segmentation fault.
0x00007ffff746a2fc in __strlen_evex () from /lib64/libc.so.6
Missing separate debuginfos, use: dnf debuginfo-install bzip2-libs-1.0.8-8.el9.x86_64 libuuid-2.37.4-11.el9_2.x86_64
(gdb) bt
#0  0x00007ffff746a2fc in __strlen_evex () from /lib64/libc.so.6
#1  0x00007ffff744dd43 in strdup () from /lib64/libc.so.6
#2  0x00007ffff7daffe6 in NC_rcfile_insert () from /lib64/libnetcdf.so.19
#3  0x00007fffd3dadb0b in nc_rc_set ()
   from /usr/local/lib64/python3.9/site-packages/netCDF4/../netCDF4.libs/libnetcdf-98543858.so.19
#4  0x00007fffd8149dd7 in ?? ()
   from /usr/local/lib64/python3.9/site-packages/netCDF4/_netCDF4.cpython-39-x86_64-linux-gnu.so
#5  0x00007ffff7a4ef03 in PyModule_ExecDef () from /lib64/libpython3.9.so.1.0
#6  0x00007ffff7a4ee74 in _imp_exec_builtin () from /lib64/libpython3.9.so.1.0
#7  0x00007ffff79de217 in cfunction_vectorcall_O () from /lib64/libpython3.9.so.1.0
#8  0x00007ffff79d7e99 in _PyEval_EvalFrameDefault () from /lib64/libpython3.9.so.1.0
#9  0x00007ffff79d0e1d in _PyEval_EvalCode () from /lib64/libpython3.9.so.1.0
#10 0x00007ffff79deb65 in _PyFunction_Vectorcall () from /lib64/libpython3.9.so.1.0
#11 0x00007ffff79d6f6b in _PyEval_EvalFrameDefault () from /lib64/libpython3.9.so.1.0
#12 0x00007ffff79dedf3 in function_code_fastcall () from /lib64/libpython3.9.so.1.0
#13 0x00007ffff79d253a in _PyEval_EvalFrameDefault () from /lib64/libpython3.9.so.1.0
#14 0x00007ffff79dedf3 in function_code_fastcall () from /lib64/libpython3.9.so.1.0
#15 0x00007ffff79d2261 in _PyEval_EvalFrameDefault () from /lib64/libpython3.9.so.1.0
#16 0x00007ffff79dedf3 in function_code_fastcall () from /lib64/libpython3.9.so.1.0
#17 0x00007ffff79d2261 in _PyEval_EvalFrameDefault () from /lib64/libpython3.9.so.1.0
#18 0x00007ffff79dedf3 in function_code_fastcall () from /lib64/libpython3.9.so.1.0
#19 0x00007ffff79de3de in object_vacall () from /lib64/libpython3.9.so.1.0
#20 0x00007ffff79e86cc in _PyObject_CallMethodIdObjArgs () from /lib64/libpython3.9.so.1.0
#21 0x00007ffff79e829a in PyImport_ImportModuleLevelObject () from /lib64/libpython3.9.so.1.0
#22 0x00007ffff79d5d63 in _PyEval_EvalFrameDefault () from /lib64/libpython3.9.so.1.0
#23 0x00007ffff79d0e1d in _PyEval_EvalCode () from /lib64/libpython3.9.so.1.0
#24 0x00007ffff7a4b855 in _PyEval_EvalCodeWithName () from /lib64/libpython3.9.so.1.0
--Type <RET> for more, q to quit, c to continue without paging--c
#25 0x00007ffff7a4b7ed in PyEval_EvalCodeEx () from /lib64/libpython3.9.so.1.0
#26 0x00007ffff7a4b79f in PyEval_EvalCode () from /lib64/libpython3.9.so.1.0
#27 0x00007ffff7a51253 in builtin_exec () from /lib64/libpython3.9.so.1.0
#28 0x00007ffff79df0d0 in cfunction_vectorcall_FASTCALL () from /lib64/libpython3.9.so.1.0
#29 0x00007ffff79d7e99 in _PyEval_EvalFrameDefault () from /lib64/libpython3.9.so.1.0
#30 0x00007ffff79d0e1d in _PyEval_EvalCode () from /lib64/libpython3.9.so.1.0
#31 0x00007ffff79deb65 in _PyFunction_Vectorcall () from /lib64/libpython3.9.so.1.0
#32 0x00007ffff79d6f6b in _PyEval_EvalFrameDefault () from /lib64/libpython3.9.so.1.0
#33 0x00007ffff79dedf3 in function_code_fastcall () from /lib64/libpython3.9.so.1.0
#34 0x00007ffff79d253a in _PyEval_EvalFrameDefault () from /lib64/libpython3.9.so.1.0
#35 0x00007ffff79dedf3 in function_code_fastcall () from /lib64/libpython3.9.so.1.0
#36 0x00007ffff79d2261 in _PyEval_EvalFrameDefault () from /lib64/libpython3.9.so.1.0
#37 0x00007ffff79dedf3 in function_code_fastcall () from /lib64/libpython3.9.so.1.0
#38 0x00007ffff79d2261 in _PyEval_EvalFrameDefault () from /lib64/libpython3.9.so.1.0
#39 0x00007ffff79dedf3 in function_code_fastcall () from /lib64/libpython3.9.so.1.0
#40 0x00007ffff79de3de in object_vacall () from /lib64/libpython3.9.so.1.0
#41 0x00007ffff79e86cc in _PyObject_CallMethodIdObjArgs () from /lib64/libpython3.9.so.1.0
#42 0x00007ffff79e829a in PyImport_ImportModuleLevelObject () from /lib64/libpython3.9.so.1.0
#43 0x00007ffff79d5d63 in _PyEval_EvalFrameDefault () from /lib64/libpython3.9.so.1.0
#44 0x00007ffff79d0e1d in _PyEval_EvalCode () from /lib64/libpython3.9.so.1.0
#45 0x00007ffff7a4b855 in _PyEval_EvalCodeWithName () from /lib64/libpython3.9.so.1.0
#46 0x00007ffff7a4b7ed in PyEval_EvalCodeEx () from /lib64/libpython3.9.so.1.0
#47 0x00007ffff7a4b79f in PyEval_EvalCode () from /lib64/libpython3.9.so.1.0
#48 0x00007ffff7a51253 in builtin_exec () from /lib64/libpython3.9.so.1.0
#49 0x00007ffff79df0d0 in cfunction_vectorcall_FASTCALL () from /lib64/libpython3.9.so.1.0
#50 0x00007ffff79d7e99 in _PyEval_EvalFrameDefault () from /lib64/libpython3.9.so.1.0
#51 0x00007ffff79d0e1d in _PyEval_EvalCode () from /lib64/libpython3.9.so.1.0
#52 0x00007ffff79deb65 in _PyFunction_Vectorcall () from /lib64/libpython3.9.so.1.0
#53 0x00007ffff79d6f6b in _PyEval_EvalFrameDefault () from /lib64/libpython3.9.so.1.0
#54 0x00007ffff79dedf3 in function_code_fastcall () from /lib64/libpython3.9.so.1.0
#55 0x00007ffff79d253a in _PyEval_EvalFrameDefault () from /lib64/libpython3.9.so.1.0
#56 0x00007ffff79dedf3 in function_code_fastcall () from /lib64/libpython3.9.so.1.0
#57 0x00007ffff79d2261 in _PyEval_EvalFrameDefault () from /lib64/libpython3.9.so.1.0
#58 0x00007ffff79dedf3 in function_code_fastcall () from /lib64/libpython3.9.so.1.0
#59 0x00007ffff79d2261 in _PyEval_EvalFrameDefault () from /lib64/libpython3.9.so.1.0
#60 0x00007ffff79dedf3 in function_code_fastcall () from /lib64/libpython3.9.so.1.0
#61 0x00007ffff79de3de in object_vacall () from /lib64/libpython3.9.so.1.0
#62 0x00007ffff79e86cc in _PyObject_CallMethodIdObjArgs () from /lib64/libpython3.9.so.1.0
#63 0x00007ffff79e829a in PyImport_ImportModuleLevelObject () from /lib64/libpython3.9.so.1.0
#64 0x00007ffff79d5d63 in _PyEval_EvalFrameDefault () from /lib64/libpython3.9.so.1.0
#65 0x00007ffff79d0e1d in _PyEval_EvalCode () from /lib64/libpython3.9.so.1.0
#66 0x00007ffff7a4b855 in _PyEval_EvalCodeWithName () from /lib64/libpython3.9.so.1.0
#67 0x00007ffff7a4b7ed in PyEval_EvalCodeEx () from /lib64/libpython3.9.so.1.0
#68 0x00007ffff7a4b79f in PyEval_EvalCode () from /lib64/libpython3.9.so.1.0
#69 0x00007ffff7a51253 in builtin_exec () from /lib64/libpython3.9.so.1.0
#70 0x00007ffff79df0d0 in cfunction_vectorcall_FASTCALL () from /lib64/libpython3.9.so.1.0
#71 0x00007ffff79d7e99 in _PyEval_EvalFrameDefault () from /lib64/libpython3.9.so.1.0
#72 0x00007ffff79d0e1d in _PyEval_EvalCode () from /lib64/libpython3.9.so.1.0
#73 0x00007ffff79deb65 in _PyFunction_Vectorcall () from /lib64/libpython3.9.so.1.0
#74 0x00007ffff79d6f6b in _PyEval_EvalFrameDefault () from /lib64/libpython3.9.so.1.0
#75 0x00007ffff79dedf3 in function_code_fastcall () from /lib64/libpython3.9.so.1.0
#76 0x00007ffff79d253a in _PyEval_EvalFrameDefault () from /lib64/libpython3.9.so.1.0
#77 0x00007ffff79dedf3 in function_code_fastcall () from /lib64/libpython3.9.so.1.0
#78 0x00007ffff79d2261 in _PyEval_EvalFrameDefault () from /lib64/libpython3.9.so.1.0
#79 0x00007ffff79dedf3 in function_code_fastcall () from /lib64/libpython3.9.so.1.0
#80 0x00007ffff79d2261 in _PyEval_EvalFrameDefault () from /lib64/libpython3.9.so.1.0
#81 0x00007ffff79dedf3 in function_code_fastcall () from /lib64/libpython3.9.so.1.0
#82 0x00007ffff79de3de in object_vacall () from /lib64/libpython3.9.so.1.0
#83 0x00007ffff79e86cc in _PyObject_CallMethodIdObjArgs () from /lib64/libpython3.9.so.1.0
#84 0x00007ffff79e829a in PyImport_ImportModuleLevelObject () from /lib64/libpython3.9.so.1.0
#85 0x00007ffff79d5d63 in _PyEval_EvalFrameDefault () from /lib64/libpython3.9.so.1.0
#86 0x00007ffff79d0e1d in _PyEval_EvalCode () from /lib64/libpython3.9.so.1.0
#87 0x00007ffff7a4b855 in _PyEval_EvalCodeWithName () from /lib64/libpython3.9.so.1.0
#88 0x00007ffff7a4b7ed in PyEval_EvalCodeEx () from /lib64/libpython3.9.so.1.0
#89 0x00007ffff7a4b79f in PyEval_EvalCode () from /lib64/libpython3.9.so.1.0
#90 0x00007ffff7a51253 in builtin_exec () from /lib64/libpython3.9.so.1.0
#91 0x00007ffff79df0d0 in cfunction_vectorcall_FASTCALL () from /lib64/libpython3.9.so.1.0
#92 0x00007ffff79d7e99 in _PyEval_EvalFrameDefault () from /lib64/libpython3.9.so.1.0
#93 0x00007ffff79d0e1d in _PyEval_EvalCode () from /lib64/libpython3.9.so.1.0
#94 0x00007ffff79deb65 in _PyFunction_Vectorcall () from /lib64/libpython3.9.so.1.0
#95 0x00007ffff79d6f6b in _PyEval_EvalFrameDefault () from /lib64/libpython3.9.so.1.0
#96 0x00007ffff79dedf3 in function_code_fastcall () from /lib64/libpython3.9.so.1.0
#97 0x00007ffff79d253a in _PyEval_EvalFrameDefault () from /lib64/libpython3.9.so.1.0
#98 0x00007ffff79dedf3 in function_code_fastcall () from /lib64/libpython3.9.so.1.0
#99 0x00007ffff79d2261 in _PyEval_EvalFrameDefault () from /lib64/libpython3.9.so.1.0
#100 0x00007ffff79dedf3 in function_code_fastcall () from /lib64/libpython3.9.so.1.0
#101 0x00007ffff79d2261 in _PyEval_EvalFrameDefault () from /lib64/libpython3.9.so.1.0
#102 0x00007ffff79dedf3 in function_code_fastcall () from /lib64/libpython3.9.so.1.0
#103 0x00007ffff79de3de in object_vacall () from /lib64/libpython3.9.so.1.0
#104 0x00007ffff79e86cc in _PyObject_CallMethodIdObjArgs () from /lib64/libpython3.9.so.1.0
#105 0x00007ffff79e829a in PyImport_ImportModuleLevelObject () from /lib64/libpython3.9.so.1.0
#106 0x00007ffff79d5d63 in _PyEval_EvalFrameDefault () from /lib64/libpython3.9.so.1.0
#107 0x00007ffff79d0e1d in _PyEval_EvalCode () from /lib64/libpython3.9.so.1.0
#108 0x00007ffff7a4b855 in _PyEval_EvalCodeWithName () from /lib64/libpython3.9.so.1.0
#109 0x00007ffff7a4b7ed in PyEval_EvalCodeEx () from /lib64/libpython3.9.so.1.0
#110 0x00007ffff7a4b79f in PyEval_EvalCode () from /lib64/libpython3.9.so.1.0
#111 0x00007ffff7a51253 in builtin_exec () from /lib64/libpython3.9.so.1.0
#112 0x00007ffff79df0d0 in cfunction_vectorcall_FASTCALL () from /lib64/libpython3.9.so.1.0
#113 0x00007ffff79d7e99 in _PyEval_EvalFrameDefault () from /lib64/libpython3.9.so.1.0
#114 0x00007ffff79d0e1d in _PyEval_EvalCode () from /lib64/libpython3.9.so.1.0
#115 0x00007ffff79deb65 in _PyFunction_Vectorcall () from /lib64/libpython3.9.so.1.0
#116 0x00007ffff79d6f6b in _PyEval_EvalFrameDefault () from /lib64/libpython3.9.so.1.0
#117 0x00007ffff79dedf3 in function_code_fastcall () from /lib64/libpython3.9.so.1.0
#118 0x00007ffff79d253a in _PyEval_EvalFrameDefault () from /lib64/libpython3.9.so.1.0
#119 0x00007ffff79dedf3 in function_code_fastcall () from /lib64/libpython3.9.so.1.0
#120 0x00007ffff79d2261 in _PyEval_EvalFrameDefault () from /lib64/libpython3.9.so.1.0
#121 0x00007ffff79dedf3 in function_code_fastcall () from /lib64/libpython3.9.so.1.0
#122 0x00007ffff79d2261 in _PyEval_EvalFrameDefault () from /lib64/libpython3.9.so.1.0
#123 0x00007ffff79dedf3 in function_code_fastcall () from /lib64/libpython3.9.so.1.0
#124 0x00007ffff79de3de in object_vacall () from /lib64/libpython3.9.so.1.0
#125 0x00007ffff79e86cc in _PyObject_CallMethodIdObjArgs () from /lib64/libpython3.9.so.1.0
#126 0x00007ffff79e829a in PyImport_ImportModuleLevelObject () from /lib64/libpython3.9.so.1.0
#127 0x00007ffff79efeec in builtin___import__ () from /lib64/libpython3.9.so.1.0
#128 0x00007ffff79e8742 in cfunction_call () from /lib64/libpython3.9.so.1.0
#129 0x00007ffff79da664 in _PyObject_MakeTpCall () from /lib64/libpython3.9.so.1.0
#130 0x00007ffff79e199e in _PyObject_CallFunctionVa () from /lib64/libpython3.9.so.1.0
#131 0x00007ffff79efe33 in PyObject_CallFunction () from /lib64/libpython3.9.so.1.0
#132 0x00007ffff79efc13 in PyImport_Import () from /lib64/libpython3.9.so.1.0
#133 0x00007ffff7a54add in PyImport_ImportModule () from /lib64/libpython3.9.so.1.0
#134 0x0000000000433742 in pybind11::module_::import (name=0x3a69f20 "nwm_routing.__main__") at /ngen/extern/pybind11/include/pybind11/pybind11.h:1195
#135 0x0000000000435ea1 in utils::ngenPy::InterpreterUtil::importTopLevelModule (this=0x982750, topLevelName="nwm_routing.__main__") at /ngen/include/utilities/python/InterpreterUtil.hpp:332
#136 0x00000000007096a1 in utils::ngenPy::InterpreterUtil::getModule(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
#137 0x000000000070962d in utils::ngenPy::InterpreterUtil::getPyModule(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
#138 0x0000000000708d1b in routing_py_adapter::Routing_Py_Adapter::Routing_Py_Adapter(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) ()
#139 0x00000000004518c8 in std::make_unique<routing_py_adapter::Routing_Py_Adapter, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&> () at /usr/include/c++/11/bits/unique_ptr.h:962
#140 0x0000000000417f6b in main (argc=6, argv=0x7fffffffe5d8) at /ngen/src/NGen.cpp:305
(gdb)

@TrupeshKumarPatel
Copy link
Collaborator

PR #47, PR #45 and updating ngen.yaml fixes the segment problem.

Tested on x86 image, for both serial and parallel version.

@jameshalgren
Copy link
Member

Discussed all of this. Here is a summary of the issues that were address or observed while working on this:

  • NetCDF Library issue (still working on root cause) — when you install the netcdf python package, there are Cython extension modules that get compiled. Recently, the modules started to come pre-built. But now, if you have a different pre-built library, it was breaking.
  • The second issue was related to hydro fabric, but that may not have been the issue here, based on the stack traces. That issue was only happening with the 2.0 version of the hydro fabric, which we haven’t used yet, so we’ll need to pay attention to that as we transition to the new data. The resolutions upstream should have managed that ahead.
  • There was an issue with the Cython binary (an environment variable). Pinning version fixed that, though the Cython upstream may have resolved this.
  • There was a pyarrow issue, maybe not really related, but it has cropped up.

@arpita0911patel
Copy link
Member Author

arpita0911patel commented Nov 28, 2023

New Input data AWI_03W_113060_002 created using AWI_03W_113060_001 and updated ngen.yaml file to address this issue. README file is updated to use the latest input data.

@arpita0911patel
Copy link
Member Author

Marking the issue as resolved as both serial and parallel runs are functioning correctly with the latest ARM image. Appreciate everyone's contributions!

@JoshCu JoshCu mentioned this issue Nov 29, 2023
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants