BART-Large: RuntimeError: CUDA error: the launch timed out and was terminated #2311

jogonba2 · 2020-07-09T13:10:40Z

❓ Questions and Help

What is your question?

I am finetuning BART-Large in a translation task through fairseq command line as in Bart-Fairseq but before starting the second epoch the process is terminated with the following error:

terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: the launch timed out and was terminated (insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:771)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x7f9582a79536 in /home/ml/users/jgonza38/anaconda3/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x7ae (0x7f9582cbcfbe in /home/ml/users/jgonza38/anaconda3/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f9582a69abd in /home/ml/users/jgonza38/anaconda3/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: std::vector<c10d::Reducer::Bucket, std::allocatorc10d::Reducer::Bucket >::~vector() + 0x1d9 (0x7f95cebab619 in /home/ml/users/jgonza38/anaconda3/lib/python3.7/site-packages/torch/lib/libtorc$
frame #4: c10d::Reducer::~Reducer() + 0x23a (0x7f95ceba0f6a in /home/ml/users/jgonza38/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7f95ceb7fef2 in /home/ml/users/jgonza38/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python$
frame #6: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x46 (0x7f95ce542506 in /home/ml/users/jgonza38/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #7: + 0x871b9b (0x7f95ceb80b9b in /home/ml/users/jgonza38/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #8: + 0x2405b0 (0x7f95ce54f5b0 in /home/ml/users/jgonza38/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #9: + 0x2417fe (0x7f95ce5507fe in /home/ml/users/jgonza38/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #10: + 0x10f998 (0x562f97760998 in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #11: + 0x1ad971 (0x562f977fe971 in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #12: + 0x10f998 (0x562f97760998 in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #13: + 0x1ad971 (0x562f977fe971 in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #14: + 0xfec08 (0x562f9774fc08 in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #15: + 0x1100f7 (0x562f977610f7 in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #16: + 0x11010d (0x562f9776110d in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #17: + 0x11010d (0x562f9776110d in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #18: + 0x110a97 (0x562f97761a97 in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #19: + 0x110b34 (0x562f97761b34 in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #20: + 0x1e91b3 (0x562f9783a1b3 in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #21: _PyEval_EvalFrameDefault + 0x2966 (0x562f97823d96 in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #22: _PyFunction_FastCallKeywords + 0xfb (0x562f977c979b in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #23: _PyEval_EvalFrameDefault + 0x6a0 (0x562f97821ad0 in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #24: _PyFunction_FastCallKeywords + 0xfb (0x562f977c979b in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #25: _PyEval_EvalFrameDefault + 0x416 (0x562f97821846 in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #26: _PyEval_EvalCodeWithName + 0x2f9 (0x562f977674f9 in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #27: _PyFunction_FastCallKeywords + 0x387 (0x562f977c9a27 in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #28: _PyEval_EvalFrameDefault + 0x14ce (0x562f978228fe in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #29: _PyEval_EvalCodeWithName + 0x2f9 (0x562f977674f9 in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #30: PyEval_EvalCodeEx + 0x44 (0x562f977683c4 in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #31: PyEval_EvalCode + 0x1c (0x562f977683ec in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #32: + 0x22f874 (0x562f97880874 in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #33: PyRun_StringFlags + 0x7d (0x562f9788baad in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #34: PyRun_SimpleStringFlags + 0x3f (0x562f9788bb0f in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #35: + 0x23ac0d (0x562f9788bc0d in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #36: _Py_UnixMain + 0x3c (0x562f9788bf7c in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #37: __libc_start_main + 0xe7 (0x7f95de6d5b97 in /lib/x86_64-linux-gnu/libc.so.6)
frame #38: + 0x1e0122 (0x562f97831122 in /home/ml/users/jgonza38/anaconda3/bin/python)

Process 1 terminated with the following error:
Traceback (most recent call last):
File "/home/ml/users/jgonza38/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq_cli/train.py", line 324, in distributed_main
main(args, init_distributed=True)
File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq_cli/train.py", line 117, in main
valid_losses = train(args, trainer, task, epoch_itr, max_update)
File "/home/ml/users/jgonza38/anaconda3/lib/python3.7/contextlib.py", line 74, in inner
return func(*args, **kwds)
File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq_cli/train.py", line 187, in train
log_output = trainer.train_step(samples)
File "/home/ml/users/jgonza38/anaconda3/lib/python3.7/contextlib.py", line 74, in inner
return func(*args, **kwds)
File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq/trainer.py", line 412, in train_step
logging_outputs, sample_size, ooms, ignore=is_dummy_batch,
File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq/trainer.py", line 685, in _aggregate_logging_outputs
logging_outputs, *extra_stats_to_sum, ignore=ignore
File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq/trainer.py", line 746, in _fast_stat_sync_sum
group=self.data_parallel_process_group
File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq/distributed_utils.py", line 292, in all_reduce_dict
cpu_data = _all_reduce_dict(cpu_data)
File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq/distributed_utils.py", line 288, in _all_reduce_dict
buf = torch.stack(list(data.values())).to(device=device)
RuntimeError: CUDA error: the launch timed out and was terminated

I also used CUDA_LAUNCH_BLOCKING=1 to get a "more detailed error":

Process 1 terminated with the following error:
Traceback (most recent call last):
File "/home/ml/users/jgonza38/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq_cli/train.py", line 324, in distributed_main
main(args, init_distributed=True)
File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq_cli/train.py", line 117, in main
valid_losses = train(args, trainer, task, epoch_itr, max_update)
File "/home/ml/users/jgonza38/anaconda3/lib/python3.7/contextlib.py", line 74, in inner
return func(*args, **kwds)
File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq_cli/train.py", line 187, in train
log_output = trainer.train_step(samples)
File "/home/ml/users/jgonza38/anaconda3/lib/python3.7/contextlib.py", line 74, in inner
return func(*args, **kwds)
File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq/trainer.py", line 399, in train_step
raise e
File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq/trainer.py", line 377, in train_step
ignore_grad=is_dummy_batch,
File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq/tasks/fairseq_task.py", line 342, in train_step
optimizer.backward(loss)
File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq/optim/fairseq_optimizer.py", line 81, in backward
loss.backward()
File "/home/ml/users/jgonza38/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 198, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/ml/users/jgonza38/anaconda3/lib/python3.7/site-packages/torch/autograd/init.py", line 100, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1591914855613/work/torch/lib/c10d/ProcessGroupNCCL.cpp:32, unhandled cuda error, NCCL version 2.4.8

My interpretation of the traceback for the first error "RuntimeError: CUDA error: the launch timed out and was terminated" is the _all_reduce_dict takes much time in CPU. It could be related with this? Is there some option to increase the CUDA time out?

Code

CUDA_VISIBLE_DEVICES=0,1 python train.py "$FULL_TASK-bin"
--max-epoch $MAX_EPOCHS
--max-tokens $MAX_TOKENS
--update-freq $UPDATE_FREQ
--lr-scheduler polynomial_decay
--lr $LR
--total-num-update $TOTAL_NUM_UPDATES
--warmup-updates $WARMUP_UPDATES
--restore-file $BART/model.pt
--save-dir $RESULTS_PATH
--task translation
--source-lang source
--target-lang target
--truncate-source
--layernorm-embedding
--share-all-embeddings
--share-decoder-input-output-embed
--reset-optimizer
--reset-dataloader
--reset-meters
--required-batch-size-multiple 1
--arch bart_large
--criterion label_smoothed_cross_entropy
--label-smoothing 0.1
--dropout 0.1
--attention-dropout 0.1
--weight-decay 0.01
--optimizer adam
--adam-betas "(0.9, 0.999)"
--adam-eps 1e-08
--clip-norm 0.1
--no-last-checkpoints
--find-unused-parameters;

What have you tried?

To reduce the --max-tokens in case it was related to a GPU memory limitation.
To use --distributed-no-spawn (#cpp:272, unhandled system error #826)
To use different versions of pytorch (1.4.0 and 1.5.1)
To use different versions of fairseq (the master branch and the latest release of December)
The experiment runs well in another machine with 2x GeForce RTX 2080 Ti with the same python/pytorch/fairseq environment.

What's your environment?

fairseq Version (e.g., 1.0 or master): master
PyTorch Version (e.g., 1.0): 1.5.1
OS (e.g., Linux): Ubuntu 18.04.4 LTS
How you installed fairseq (pip, source): pip install --editable ./
Build command you used (if compiling from source):
Python version: 3.7.3
CUDA/cuDNN version: CUDA 10.2, CUDNN 7.6.5
GPU models and configuration: 2x GeForce GTX TITAN X
Any other relevant information:

The text was updated successfully, but these errors were encountered:

myleott · 2020-07-14T16:48:12Z

Hmm, I haven't seen this error before. Can you try adding --ddp-backend=no_c10d using the latest master branch?

jogonba2 · 2020-07-15T13:57:06Z

I also tried adding --ddp-backend=no_c10d with the master branch I used in my experimentation (not the latest one), and it thrown the same error. I will test it with the latest master branch once the GPUs are available and I will edit this comment.

Thank you!

alsheabi · 2021-02-07T22:08:37Z

@jogonba2 Any solution because I faced similar error :(

aichase · 2021-06-10T06:37:08Z

Same error after the validation, any solution?

Jxu-Thu · 2021-07-01T14:18:58Z

Same error after the validation, any solution?

nxcvbc · 2022-04-01T16:37:21Z

Same error after the validation, any solution?

jogonba2 added needs triage question labels Jul 9, 2020

myleott removed the needs triage label Jul 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BART-Large: RuntimeError: CUDA error: the launch timed out and was terminated #2311

BART-Large: RuntimeError: CUDA error: the launch timed out and was terminated #2311

jogonba2 commented Jul 9, 2020 •

edited

Loading

myleott commented Jul 14, 2020

jogonba2 commented Jul 15, 2020

alsheabi commented Feb 7, 2021

aichase commented Jun 10, 2021

Jxu-Thu commented Jul 1, 2021

nxcvbc commented Apr 1, 2022

BART-Large: RuntimeError: CUDA error: the launch timed out and was terminated #2311

BART-Large: RuntimeError: CUDA error: the launch timed out and was terminated #2311

Comments

jogonba2 commented Jul 9, 2020 • edited Loading

❓ Questions and Help

What is your question?

Code

What have you tried?

What's your environment?

myleott commented Jul 14, 2020

jogonba2 commented Jul 15, 2020

alsheabi commented Feb 7, 2021

aichase commented Jun 10, 2021

Jxu-Thu commented Jul 1, 2021

nxcvbc commented Apr 1, 2022

jogonba2 commented Jul 9, 2020 •

edited

Loading