Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting segmentation fault on CentOS #88

Open
stefan-falk opened this issue Mar 11, 2021 · 0 comments
Open

Getting segmentation fault on CentOS #88

stefan-falk opened this issue Mar 11, 2021 · 0 comments

Comments

@stefan-falk
Copy link

stefan-falk commented Mar 11, 2021

I am using the warp-transducer successfully on other machines (Ubuntu 18.04) but on one, which is a CentOS, I am getting a Segmentation Fault right at the beginning of the training.

Now, I am not sure what is causing this. The only difference I can point out is that the CentOS machine uses gcc/g++ 4.8.5 (also tried 5.3.1) instead of 5.4.x on my other machines. Could this be the reason for that issue?

Compilation Output

$ CUDA_HOME=/usr/local/cuda ./scripts/build_rnnt.sh
Removing existing build/ directory ..
#################################################################
Running cmake for warp-transducer ..
-- The C compiler identification is GNU 4.8.5
-- The CXX compiler identification is GNU 4.8.5
-- Check for working C compiler: /usr/lib64/ccache/cc
-- Check for working C compiler: /usr/lib64/ccache/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working CXX compiler: /usr/lib64/ccache/c++
-- Check for working CXX compiler: /usr/lib64/ccache/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Found CUDA: /usr/local/cuda (found version "11.0") 
-- cuda found TRUE
-- Building shared library with GPU support
-- Configuring done
-- Generating done
CMake Warning:
  Manually-specified variables were not used by the project:

    CMAKE_CXX_COMPILER_LAUNCHER
    CMAKE_C_COMPILER_LAUNCHER


-- Build files have been written to: /home/sfalk/workspaces/git/speech-v2/warp-transducer/build
#################################################################
Running make ..
[ 11%] Building NVCC (Device) object CMakeFiles/warprnnt.dir/src/./warprnnt_generated_rnnt_entrypoint.cu.o
nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
Scanning dependencies of target warprnnt
Linking CXX shared library libwarprnnt.so
[ 11%] Built target warprnnt
Scanning dependencies of target test_cpu
[ 22%] Building CXX object CMakeFiles/test_cpu.dir/tests/test_cpu.cpp.o
[ 33%] Building CXX object CMakeFiles/test_cpu.dir/tests/random.cpp.o
Linking CXX executable test_cpu
[ 33%] Built target test_cpu
[ 44%] Building NVCC (Device) object CMakeFiles/test_gpu.dir/tests/./test_gpu_generated_test_gpu.cu.o
nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
Scanning dependencies of target test_gpu
[ 55%] Building CXX object CMakeFiles/test_gpu.dir/tests/random.cpp.o
Linking CXX executable test_gpu
[ 55%] Built target test_gpu
Scanning dependencies of target test_time
[ 66%] Building CXX object CMakeFiles/test_time.dir/tests/test_time.cpp.o
[ 77%] Building CXX object CMakeFiles/test_time.dir/tests/random.cpp.o
Linking CXX executable test_time
[ 77%] Built target test_time
[ 88%] Building NVCC (Device) object CMakeFiles/test_time_gpu.dir/tests/./test_time_gpu_generated_test_time.cu.o
nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
Scanning dependencies of target test_time_gpu
[100%] Building CXX object CMakeFiles/test_time_gpu.dir/tests/random.cpp.o
Linking CXX executable test_time_gpu
[100%] Built target test_time_gpu
#################################################################
Running setup.py for tensorflow bindings ..
2021-03-11 08:32:27.494442: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
setup.py:63: UserWarning: Assuming tensorflow was compiled without C++11 ABI. It is generally true if you are using binary pip package. If you compiled tensorflow from source with gcc >= 5 and didn't set -D_GLIBCXX_USE_CXX11_ABI=0 during compilation, you need to set environment variable TF_CXX11_ABI=1 when compiling this bindings. Also be sure to touch some files in src to trigger recompilation. Also, you need to set (or unsed) this environment variable if getting undefined symbol: _ZN10tensorflow... errors
  warnings.warn("Assuming tensorflow was compiled without C++11 ABI. "
running install
running bdist_egg
running egg_info
writing warprnnt_tensorflow.egg-info/PKG-INFO
writing dependency_links to warprnnt_tensorflow.egg-info/dependency_links.txt
writing top-level names to warprnnt_tensorflow.egg-info/top_level.txt
reading manifest file 'warprnnt_tensorflow.egg-info/SOURCES.txt'
writing manifest file 'warprnnt_tensorflow.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
running build_ext
creating build/bdist.linux-x86_64/egg
creating build/bdist.linux-x86_64/egg/warprnnt_tensorflow
copying build/lib.linux-x86_64-3.8/warprnnt_tensorflow/__init__.py -> build/bdist.linux-x86_64/egg/warprnnt_tensorflow
copying build/lib.linux-x86_64-3.8/warprnnt_tensorflow/kernels.cpython-38-x86_64-linux-gnu.so -> build/bdist.linux-x86_64/egg/warprnnt_tensorflow
byte-compiling build/bdist.linux-x86_64/egg/warprnnt_tensorflow/__init__.py to __init__.cpython-38.pyc
creating stub loader for warprnnt_tensorflow/kernels.cpython-38-x86_64-linux-gnu.so
byte-compiling build/bdist.linux-x86_64/egg/warprnnt_tensorflow/kernels.py to kernels.cpython-38.pyc
creating build/bdist.linux-x86_64/egg/EGG-INFO
copying warprnnt_tensorflow.egg-info/PKG-INFO -> build/bdist.linux-x86_64/egg/EGG-INFO
copying warprnnt_tensorflow.egg-info/SOURCES.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying warprnnt_tensorflow.egg-info/dependency_links.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying warprnnt_tensorflow.egg-info/top_level.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
writing build/bdist.linux-x86_64/egg/EGG-INFO/native_libs.txt
zip_safe flag not set; analyzing archive contents...
warprnnt_tensorflow.__pycache__.__init__.cpython-38: module references __path__
warprnnt_tensorflow.__pycache__.kernels.cpython-38: module references __file__
creating 'dist/warprnnt_tensorflow-0.1-py3.8-linux-x86_64.egg' and adding 'build/bdist.linux-x86_64/egg' to it
removing 'build/bdist.linux-x86_64/egg' (and everything under it)
Processing warprnnt_tensorflow-0.1-py3.8-linux-x86_64.egg
creating /home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/warprnnt_tensorflow-0.1-py3.8-linux-x86_64.egg
Extracting warprnnt_tensorflow-0.1-py3.8-linux-x86_64.egg to /home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages
Adding warprnnt-tensorflow 0.1 to easy-install.pth file

Installed /home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/warprnnt_tensorflow-0.1-py3.8-linux-x86_64.egg
Processing dependencies for warprnnt-tensorflow==0.1
Finished processing dependencies for warprnnt-tensorflow==0.1
(asr2) [sfalk@everestspeech-v2]$ python -c "from warprnnt_tensorflow import rnnt_loss"
2021-03-11 08:32:42.757357: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-03-11 08:32:44.642952: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0

Segmentation Fault

Epoch 1/5000
Fatal Python error: Segmentation fault

Current thread 0x00007f8ea1ffa700 (most recent call first):
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 1853 in _create_c_op
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 2015 in __init__
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 3528 in _create_op_internal
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/framework/func_graph.py", line 590 in _create_op_internal
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/framework/op_def_library.py", line 748 in _apply_op_helper
  File "<string>", line 80 in warp_rnnt
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/warprnnt_tensorflow-0.1-py3.8-linux-x86_64.egg/warprnnt_tensorflow/__init__.py", line 32 in rnnt_loss
  File "/home/sfalk/workspaces/git/speech-v2/asr/model/transducer/__init__.py", line 252 in rnnt_loss_wrapper
  File "/home/sfalk/workspaces/git/speech-v2/asr/model/transducer/__init__.py", line 209 in rnnt_gradient
  File "/home/sfalk/workspaces/git/speech-v2/asr/model/transducer/__init__.py", line 163 in train_step
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 788 in run_step
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/autograph/impl/api.py", line 478 in _call_unconverted
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/autograph/impl/api.py", line 396 in converted_call
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/autograph/impl/api.py", line 667 in wrapper
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/distribute/mirrored_run.py", line 323 in run
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007f9389663740 (most recent call first):
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/threading.py", line 302 in wait
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/threading.py", line 558 in wait
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/distribute/mirrored_run.py", line 196 in _call_for_each_replica
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/distribute/mirrored_run.py", line 93 in call_for_each_replica
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 628 in _call_for_each_replica
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_lib.py", line 2730 in call_for_each_replica
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_lib.py", line 1259 in run
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 795 in step_function
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/autograph/impl/api.py", line 479 in _call_unconverted
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/autograph/impl/api.py", line 396 in converted_call
  File "/tmp/tmpembj6sob.py", line 16 in tf__train_function
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/autograph/impl/api.py", line 459 in converted_call
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/framework/func_graph.py", line 966 in wrapper
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 634 in wrapped_fn
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/framework/func_graph.py", line 990 in func_graph_from_py_func
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 3196 in _create_graph_function
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 3361 in _maybe_define_function
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2969 in _get_concrete_function_internal_garbage_collected
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 725 in _initialize
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 871 in _call
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 828 in __call__
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 1100 in fit
  File "asr/bin/train_keras.py", line 256 in run_training
  File "asr/bin/train_keras.py", line 292 in main
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/absl/app.py", line 251 in _run_main
  File "/home/sfalk/miniconda3/envs/asr2/lib/python3.8/site-packages/absl/app.py", line 300 in run
  File "asr/bin/train_keras.py", line 381 in <module>
Segmentation fault (core dumped)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant