Skip to content

'CUDA error: an illegal memory access was encountered' in forward  #308

@gongwei-130

Description

@gongwei-130

Hi, I'm running into the following error when attempting to train bert with ds_train_bert_bsz64k_seq128_m.sh. I printed out all tensor shapes in the batch and it looks fine since I used train_micro_batch_size_per_gpu=8 and train_batch_size=64 since I have 8 cards.

This error occurs during the forward pass of the first training step.

08/07/2020 15:02:47 - INFO - turing.logger -   worker-0: begin epoch 1 current_sample_count 0 shard_length 1000 global_data_samples 0
  0%|                                                                                                                              | 0/1000 [00:00<?, ?it/s]>>>>>>>>input_ids: torch.Size([8, 128]), input_mask:torch.Size([8, 128]), segment_ids: torch.Size([8, 128]), next_sentence_labels: torch.Size([8, 1]), mask_labels: torch.Size([8, 128])
>>>>>>>>input_ids: torch.Size([8, 128]), input_mask:torch.Size([8, 128]), segment_ids: torch.Size([8, 128]), next_sentence_labels: torch.Size([8, 1]), mask_labels: torch.Size([8, 128])
/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/nvidia/modelingpreln.py:1061: UserWarning: This overload of nonzero is deprecated:
        nonzero(Tensor input, *, Tensor out)
Consider using one of the following signatures instead:
        nonzero(Tensor input, *, bool as_tuple) (Triggered internally at  /pytorch/torch/csrc/utils/python_arg_parser.cpp:766.)
  (masked_lm_labels + 1).view(-1)).view(-1)
!!!! kernel execution error.
!!!! kernel execution error.
!!!! kernel execution error.
!!!! kernel execution error.
!!!! kernel execution error.
!!!! kernel execution error.
  0%|                                                                                                                              | 0/1000 [00:03<?, ?it/s]
Traceback (most recent call last):
  File "/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/deepspeed_train.py", line 598, in <module>
    main()
  File "/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/deepspeed_train.py", line 591, in main
    run(args, model, optimizer, start_epoch)
  File "/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/deepspeed_train.py", line 555, in run
    train_tfrecords(args, index, model, optimizer, train_data)
  File "/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/deepspeed_train.py", line 238, in train_tfrecords
    loss = model.network(batch)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/deepspeed/pt/deepspeed_light.py", line 691, in forward
    loss = self.module(*inputs, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/nvidia/modelingpreln.py", line 1056, in forward
    checkpoint_activations=checkpoint_activations)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/nvidia/modelingpreln.py", line 977, in forward
    checkpoint_activations=checkpoint_activations)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/nvidia/modelingpreln.py", line 594, in forward
    hidden_states = layer_module(hidden_states, attention_mask)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/deepspeed/pt/deepspeed_cuda.py", line 520, in forward
    self.config)
  File "/usr/local/lib/python3.7/dist-packages/deepspeed/pt/deepspeed_cuda.py", line 196, in forward
    config.gelu_checkpoint)
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f74df4931e2 in /usr/local/lib/python3.7/dist-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xad2 (0x7f74df6e1f92 in /usr/local/lib/python3.7/dist-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f74df4819cd in /usr/local/lib/python3.7/dist-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x541322 (0x7f74f041a322 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x5413c6 (0x7f74f041a3c6 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_python.so)
frame #5: /usr/bin/python3() [0x5968b1]
frame #6: /usr/bin/python3() [0x5bc890]
frame #7: /usr/bin/python3() [0x4d2ece]
frame #8: /usr/bin/python3() [0x5bcae8]
frame #9: /usr/bin/python3() [0x59688d]
frame #10: /usr/bin/python3() [0x5bc890]
frame #11: /usr/bin/python3() [0x4d2ece]
frame #12: /usr/bin/python3() [0x5bcb01]
frame #13: /usr/bin/python3() [0x59688d]
frame #14: /usr/bin/python3() [0x5bc890]
frame #15: /usr/bin/python3() [0x4d2ece]
frame #16: /usr/bin/python3() [0x5bcb01]
frame #17: /usr/bin/python3() [0x59688d]
frame #18: /usr/bin/python3() [0x5bc890]
frame #19: /usr/bin/python3() [0x4d2ece]
frame #20: /usr/bin/python3() [0x5bcb01]
frame #21: /usr/bin/python3() [0x59688d]
frame #22: /usr/bin/python3() [0x5bc890]
frame #23: /usr/bin/python3() [0x4d2ece]
frame #24: /usr/bin/python3() [0x5bcb01]
frame #25: /usr/bin/python3() [0x59688d]
frame #26: _PyTrash_thread_destroy_chain + 0x35 (0x647675 in /usr/bin/python3)
frame #27: /usr/bin/python3() [0x5d6108]
frame #28: /usr/bin/python3() [0x5d64c3]
frame #29: PyDict_SetItem + 0x337 (0x5b8ca7 in /usr/bin/python3)
frame #30: _PyModule_ClearDict + 0x107 (0x5aaae7 in /usr/bin/python3)
frame #31: PyImport_Cleanup + 0x354 (0x5386b4 in /usr/bin/python3)
frame #32: Py_FinalizeEx + 0x6e (0x633f9e in /usr/bin/python3)
frame #33: /usr/bin/python3() [0x653fcd]
frame #34: _Py_UnixMain + 0x2e (0x65420e in /usr/bin/python3)
frame #35: __libc_start_main + 0xeb (0x7f74f4fe109b in /lib/x86_64-linux-gnu/libc.so.6)
frame #36: _start + 0x2a (0x5df66a in /usr/bin/python3)

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions