cudaSuccess == cudaStat (0 vs. 77) Cuda Error: an illegal memory access was encountered #1399

Z-TAO · 2017-02-21T07:11:28Z

多GPU训练的时候提示错误：

F0221 14:00:17.320631 17204 hl_cuda_device.cc:566] Check failed: cudaSuccess == cudaStat (0 vs. 77) Cuda Error: an illegal memory access was encountered
F0221 14:00:17.320631 17206 hl_cuda_device.cc:566] Check failed: cudaSuccess == cudaStat (0 vs. 77) Cuda Error: an illegal memory access was encountered
*** Check failure stack trace: ***
    @           0x9ea800  google::LogMessage::Fail()
    @           0x9ea800  google::LogMessage::Fail()
    @           0x9ea75c  google::LogMessage::SendToLog()
    @           0x9ea75c  google::LogMessage::SendToLog()
    @           0x9ea0e0  google::LogMessage::Flush()
    @           0x9ea0e0  google::LogMessage::Flush()
    @           0x9ed187  google::LogMessageFatal::~LogMessageFatal()
    @           0x9ed187  google::LogMessageFatal::~LogMessageFatal()
    @           0x9b4437  hl_stream_synchronize()
    @           0x9b4437  hl_stream_synchronize()
    @           0x9c1007  hl_matrix_csr_mul_dense()
    @           0x61c79b  paddle::TrainerThread::valueDispatchThread()
    @           0x7c6cc7  paddle::GpuMatrix::mul()
    @           0x7ca5af  paddle::GpuMatrix::mul()
    @           0x6fa234  paddle::FullyConnectedLayer::forward()
    @           0x6411a4  paddle::NeuralNetwork::forward()
    @     0x7f60239508a0  execute_native_thread_routine
    @           0x61d161  paddle::TrainerThread::forward()
    @           0x61e35c  paddle::TrainerThread::computeThread()
    @     0x7f60248371c3  start_thread
    @     0x7f60239508a0  execute_native_thread_routine
    @     0x7f60230c112d  __clone
    @     0x7f60248371c3  start_thread
    @              (nil)  (unknown)

问题的具体表现形式是：

多卡训练的时候，指定大batch，不太容易出现以上的错误，小batch很容易出现
单卡的时候不出现这个问题

神经网络很简单，就是一个600000 * 256的fc layer。

trainer_config:

unit_word = data_layer(name='unit_words', size=word_dict_len)
rec_word = data_layer(name='recword', size=word_dict_len)
labels = data_layer(name='label', size=2)

unit_word_embedding = fc_layer(input = unit_word,
        size = 256,
        act = IdentityActivation(),
        bias_attr = False,
        param_attr = ParamAttr(name='_source_language_embedding', initial_mean=0, initial_std=0.01,sparse_update=True))

rec_word_embedding = fc_layer(input = rec_word,
        size = 256,
        act = IdentityActivation(),
        bias_attr = False,
        param_attr=ParamAttr(name='_source_language_embedding', initial_mean=0,initial_std=0.01,sparse_update=True))

output_embedding = fc_layer(input = [unit_word_embedding, rec_word_embedding],
        size = 256,
        act = TanhActivation(),
        bias_attr = True)
output_embedding2 = fc_layer(input = output_embedding,
        size = 64, 
        act = TanhActivation(),
        bias_attr = True)

final_output = fc_layer(input = output_embedding2,
        size = 2,
        act = SoftmaxActivation(),
        bias_attr = True)

cost = cross_entropy(input=final_output,
                             label=labels)

dataprovider:

def hook(settings, word_dict, **kwargs):
    settings.word_dict = word_dict
    settings.line_idx = 0
    #all inputs are integral and sequential type
    settings.slots = [
        sparse_vector(len(word_dict)),
        sparse_binary_vector(len(word_dict)),
        integer_value(2)
    ]

The text was updated successfully, but these errors were encountered:

hedaoyuan · 2017-02-21T08:22:21Z

之前有fix过相似的问题见issue #307 和pr #133
检查一下代码是不是在133这个pr之后的。

Z-TAO · 2017-02-22T08:45:16Z

问题不能解决，代码用的最新的master下编译的依然有这个问题

Z-TAO · 2017-02-22T10:06:42Z

感觉这个额问题主要跟batch size有关系，越大越不容易出现，但是最终一定会出现。。

jerryitp · 2017-03-10T02:48:46Z

存在一个gpu卡分到的数据是0就会报这个错误吧，比如4卡，最后按照batchsize算只有3个sequence，就会这样，试试改改dataprovider

hedaoyuan · 2017-03-10T02:54:53Z

@jerryitp 存在一个gpu卡分到的数据是0就会报这个错误吧这个问题 #1566 已经修复，用最新的代码可以试一下。

Co-authored-by: smallv0221 <33639025+smallv0221@users.noreply.github.com>

hedaoyuan self-assigned this Feb 21, 2017

gangliao added the Bug label Feb 23, 2017

typhoonzero mentioned this issue Jul 27, 2017

paddle任务在gpu集群上多卡执行，报an illegal memory access was encountered #3089

Closed

pkuyym closed this as completed Aug 5, 2017

wangxicoding pushed a commit to wangxicoding/Paddle that referenced this issue Dec 9, 2021

fix (PaddlePaddle#1399)

2fbd04f

Co-authored-by: smallv0221 <33639025+smallv0221@users.noreply.github.com>

lizexu123 pushed a commit to lizexu123/Paddle that referenced this issue Feb 23, 2024

fix parl version import error (PaddlePaddle#1399)

5879714

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cudaSuccess == cudaStat (0 vs. 77) Cuda Error: an illegal memory access was encountered #1399

cudaSuccess == cudaStat (0 vs. 77) Cuda Error: an illegal memory access was encountered #1399

Z-TAO commented Feb 21, 2017

hedaoyuan commented Feb 21, 2017 •

edited

Loading

Z-TAO commented Feb 22, 2017

Z-TAO commented Feb 22, 2017

jerryitp commented Mar 10, 2017

hedaoyuan commented Mar 10, 2017

cudaSuccess == cudaStat (0 vs. 77) Cuda Error: an illegal memory access was encountered #1399

cudaSuccess == cudaStat (0 vs. 77) Cuda Error: an illegal memory access was encountered #1399

Comments

Z-TAO commented Feb 21, 2017

hedaoyuan commented Feb 21, 2017 • edited Loading

Z-TAO commented Feb 22, 2017

Z-TAO commented Feb 22, 2017

jerryitp commented Mar 10, 2017

hedaoyuan commented Mar 10, 2017

hedaoyuan commented Feb 21, 2017 •

edited

Loading