Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

求助:cudnn错误 #59

Closed
wujsy opened this issue Nov 21, 2018 · 6 comments
Closed

求助:cudnn错误 #59

wujsy opened this issue Nov 21, 2018 · 6 comments

Comments

@wujsy
Copy link

wujsy commented Nov 21, 2018

Hi, 我clone最新的代码,按照guide训练st-cmds和thchs30数据,出现了一下错误,这是版本不对还是其他什么问题?谢谢
TensorFlow:1.12.0
cuda:9.0
cudnn:7.3.1
gpu:Tesla V100-PCIE-16GB

[*Info] Create Model Successful, Compiles Model Successful.
[running] train epoch 0 .
[message] epoch 0 . Have train datas 0+
Epoch 1/1
2018-11-21 15:18:48.028180: E tensorflow/stream_executor/cuda/cuda_dnn.cc:373] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2018-11-21 15:18:48.058536: E tensorflow/stream_executor/cuda/cuda_dnn.cc:373] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
Traceback (most recent call last):
File "train_mspeech.py", line 47, in
ms.TrainModel(datapath, epoch = 50, batch_size = 64, save_step = 500)
File "/data/wujiaxing/workspace/ASR/ASRT_SpeechRecognition/SpeechModel251.py", line 179, in TrainModel
self._model.fit_generator(yielddatas, save_step)
File "/usr/local/lib/python3.5/dist-packages/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/keras/engine/training.py", line 1418, in fit_generator
initial_epoch=initial_epoch)
File "/usr/local/lib/python3.5/dist-packages/keras/engine/training_generator.py", line 217, in fit_generator
class_weight=class_weight)
File "/usr/local/lib/python3.5/dist-packages/keras/engine/training.py", line 1217, in train_on_batch
outputs = self.train_function(ins)
File "/usr/local/lib/python3.5/dist-packages/keras/backend/tensorflow_backend.py", line 2715, in call
return self._call(inputs)
File "/usr/local/lib/python3.5/dist-packages/keras/backend/tensorflow_backend.py", line 2675, in _call
fetched = self._callable_fn(*array_vals)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1439, in call
run_metadata_ptr)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 528, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node conv2d_1/convolution}} = Conv2D[T=DT_FLOAT, _class=["loc:@training/Adam/gradients/conv2d_1/convolution_grad/Conv2DBackpropFilter"], data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](training/Adam/gradients/conv2d_1/convolution_grad/Conv2DBackpropFilter-0-TransposeNHWCToNCHW-LayoutOptimizer, conv2d_1/kernel/read)]]
[[{{node ctc/scan/while/Fill/_267}} = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_419_ctc/scan/while/Fill", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

@nl8590687
Copy link
Owner

从报错信息上看,应该是你的CUDA环境和cudnn的环境没有配置好
详情请参考这篇文章来正确配置GPU版TF的运行环境:Linux系统下安装TensorFlow的GPU版本

@wujsy wujsy closed this as completed Nov 21, 2018
@wujsy
Copy link
Author

wujsy commented Nov 21, 2018

谢谢回复,我设置了config.gpu_options.allow_growth = True,batchsize改回16没有问题了,但又提示OSError: Unable to create file (unable to open file: name = 'model_speech/m251/speech_model251_e_0_step_500.model', errno = 2, error message = 'No such file or directory', flags = 13, o_flags = 242)的错误

@wujsy wujsy reopened this Nov 21, 2018
@nl8590687
Copy link
Owner

训练m251模型的时候,需要在model_speech/目录下使用mkdir创建yige名为m251的目录,然后就好了

@wujsy
Copy link
Author

wujsy commented Nov 21, 2018

@nl8590687 thanks, it works

@huangx06
Copy link

Failed to get convolution algorithm. This is probably because cuDNN failed to initialize
这个报错我之前遇到是因为tensorflow版本太高,1.12降为1.11就可以了

@krishuang08
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants