Error with CUDA_ERROR_ILLEGAL_ADDRESS #313

hgneng · 2023-02-13T02:15:10Z

I have successfully run training on a Ubuntu 22.04 without GPU.

However, I fail to run on platform.virtaicloud. Training aborted with CUDA_ERROR_ILLEGAL_ADDRESS.

# python3 train_speech_model.py 
2023-02-13 09:18:21.593221: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-13 09:18:22.312066: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2023-02-13 09:18:22.312218: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 22331 MB memory:  -> device: 0, name: B1.gpu.large, pci bus id: 0000:ff:1e.0, compute capability: 8.6
[ASRT] Compiles Model Successfully.
[ASRT Training] train epoch 1/50 .
/gemini/code/speech_model.py:120: UserWarning: `Model.fit_generator` is deprecated and will be removed in a future version. Please use `Model.fit`, which supports generators.
  self.trained_model.fit_generator(yielddatas, num_iterate, callbacks=call_back)
2023-02-13 09:18:27.142189: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2023-02-13 09:18:27.142384: F tensorflow/core/common_runtime/device/device_event_mgr.cc:221] Unexpected Event status: 1
Aborted (core dumped)

What should I do? Anyone has idea can use the public environment mirror above to debug.

The text was updated successfully, but these errors were encountered:

hgneng · 2023-02-14T01:42:28Z

Python version: 3.8.10
Tensorflow version: 2.8.4
Cuda version: cuda_11.2.r11.2/compiler.29618528_0

A relative issue: https://github.com/tensorflow/tensorflow/issues/50735 But I have tried CUDA_LAUNCH_BLOCKING=1 with no luck.

nl8590687 · 2023-02-14T03:54:00Z

what gpu and cpu hardware does this platform use? It seems like gpu or cpu memory too little or hardware impactive.

hgneng · 2023-02-15T00:41:09Z

硬件配置如下，理论上是够的。而且是一开始运行就崩溃。

实例规格 B1.large
GPU：1 gpu(s)，每个GPU显存：24 GB
CPU：8 core(s)，内存：24 GB

nl8590687 · 2023-02-16T00:34:55Z

tensorflow版本可以自己选吗？尝试下配置为其他版本的

hgneng · 2023-02-16T01:41:56Z

我换了一个Tensoflow 2.8.0的镜像，结果一样。不过我发现我之前用的镜像是Tensorflow 2.10.1。但是两个环境运行下面命令都返回2.8.4的版本。我怀疑我是不是不会用……

python3 -c 'import tensorflow as tf; print(tf.__version__)'

hgneng · 2023-02-17T02:09:21Z

我知道上面的问题是为什么了，Tensoflow 2.8.4的版本要求是写在requirements.txt里的。我需要改这个文件。不过我有些奇怪，为什么requirements.txt里的版本要求这么严格，都是等于某一个版本，而不能只写几个主要的，其它按依赖安装。因为现在我改Tensorflow的版本会触发其它依赖失败，需要注释若干行才能通过。

更换版本Tensorflow版本之后运行train还是报CUDA_ERROR_ILLEGAL_ADDRESS错误。查版本的时候又报了一些错误，我去问一下平台社区，也许我安装Tensorflow的方式有误。

# python3 -c 'import tensorflow as tf; print(tf.__version__)'
2023-02-17 09:58:20.598049: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-17 09:58:20.701675: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-02-17 09:58:22.057285: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-02-17 09:58:23.898986: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib/orion:/usr/lib64:/usr/lib:/usr/lib/orion
2023-02-17 09:58:24.479348: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib/orion:/usr/lib64:/usr/lib:/usr/lib/orion
2023-02-17 09:58:24.479392: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2.10.1

hgneng · 2023-02-17T03:06:13Z

试了镜像提供的Tensorflow 2.8.0, 2.9.3, 2.10.1（不是通过pip安装的），都报CUDA_ERROR_ILLEGAL_ADDRESS错误。暂时没有什么思路了。

说明一点：在镜像提供Tensorflow的前提下，我只通过pip单独安装了matplotlib和scipy，没有安装requirements.txt。我感觉requirements.txt那个列表好像不是那么必要。实际只需要装几个，其它的依赖会自动解决。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error with CUDA_ERROR_ILLEGAL_ADDRESS #313

Error with CUDA_ERROR_ILLEGAL_ADDRESS #313

hgneng commented Feb 13, 2023 •

edited

Loading

hgneng commented Feb 14, 2023

nl8590687 commented Feb 14, 2023

hgneng commented Feb 15, 2023

nl8590687 commented Feb 16, 2023

hgneng commented Feb 16, 2023

hgneng commented Feb 17, 2023

hgneng commented Feb 17, 2023

Error with CUDA_ERROR_ILLEGAL_ADDRESS #313

Error with CUDA_ERROR_ILLEGAL_ADDRESS #313

Comments

hgneng commented Feb 13, 2023 • edited Loading

hgneng commented Feb 14, 2023

nl8590687 commented Feb 14, 2023

hgneng commented Feb 15, 2023

nl8590687 commented Feb 16, 2023

hgneng commented Feb 16, 2023

hgneng commented Feb 17, 2023

hgneng commented Feb 17, 2023

hgneng commented Feb 13, 2023 •

edited

Loading