Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BART-Large: RuntimeError: CUDA error: the launch timed out and was terminated #2311

Open
jogonba2 opened this issue Jul 9, 2020 · 6 comments
Labels

Comments

@jogonba2
Copy link

jogonba2 commented Jul 9, 2020

❓ Questions and Help

What is your question?

I am finetuning BART-Large in a translation task through fairseq command line as in Bart-Fairseq but before starting the second epoch the process is terminated with the following error:


terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: the launch timed out and was terminated (insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:771)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x7f9582a79536 in /home/ml/users/jgonza38/anaconda3/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x7ae (0x7f9582cbcfbe in /home/ml/users/jgonza38/anaconda3/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f9582a69abd in /home/ml/users/jgonza38/anaconda3/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: std::vector<c10d::Reducer::Bucket, std::allocatorc10d::Reducer::Bucket >::~vector() + 0x1d9 (0x7f95cebab619 in /home/ml/users/jgonza38/anaconda3/lib/python3.7/site-packages/torch/lib/libtorc$
frame #4: c10d::Reducer::~Reducer() + 0x23a (0x7f95ceba0f6a in /home/ml/users/jgonza38/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7f95ceb7fef2 in /home/ml/users/jgonza38/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python$
frame #6: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x46 (0x7f95ce542506 in /home/ml/users/jgonza38/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #7: + 0x871b9b (0x7f95ceb80b9b in /home/ml/users/jgonza38/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #8: + 0x2405b0 (0x7f95ce54f5b0 in /home/ml/users/jgonza38/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #9: + 0x2417fe (0x7f95ce5507fe in /home/ml/users/jgonza38/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #10: + 0x10f998 (0x562f97760998 in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #11: + 0x1ad971 (0x562f977fe971 in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #12: + 0x10f998 (0x562f97760998 in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #13: + 0x1ad971 (0x562f977fe971 in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #14: + 0xfec08 (0x562f9774fc08 in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #15: + 0x1100f7 (0x562f977610f7 in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #16: + 0x11010d (0x562f9776110d in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #17: + 0x11010d (0x562f9776110d in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #18: + 0x110a97 (0x562f97761a97 in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #19: + 0x110b34 (0x562f97761b34 in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #20: + 0x1e91b3 (0x562f9783a1b3 in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #21: _PyEval_EvalFrameDefault + 0x2966 (0x562f97823d96 in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #22: _PyFunction_FastCallKeywords + 0xfb (0x562f977c979b in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #23: _PyEval_EvalFrameDefault + 0x6a0 (0x562f97821ad0 in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #24: _PyFunction_FastCallKeywords + 0xfb (0x562f977c979b in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #25: _PyEval_EvalFrameDefault + 0x416 (0x562f97821846 in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #26: _PyEval_EvalCodeWithName + 0x2f9 (0x562f977674f9 in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #27: _PyFunction_FastCallKeywords + 0x387 (0x562f977c9a27 in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #28: _PyEval_EvalFrameDefault + 0x14ce (0x562f978228fe in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #29: _PyEval_EvalCodeWithName + 0x2f9 (0x562f977674f9 in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #30: PyEval_EvalCodeEx + 0x44 (0x562f977683c4 in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #31: PyEval_EvalCode + 0x1c (0x562f977683ec in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #32: + 0x22f874 (0x562f97880874 in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #33: PyRun_StringFlags + 0x7d (0x562f9788baad in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #34: PyRun_SimpleStringFlags + 0x3f (0x562f9788bb0f in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #35: + 0x23ac0d (0x562f9788bc0d in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #36: _Py_UnixMain + 0x3c (0x562f9788bf7c in /home/ml/users/jgonza38/anaconda3/bin/python)
frame #37: __libc_start_main + 0xe7 (0x7f95de6d5b97 in /lib/x86_64-linux-gnu/libc.so.6)
frame #38: + 0x1e0122 (0x562f97831122 in /home/ml/users/jgonza38/anaconda3/bin/python)

Process 1 terminated with the following error:
Traceback (most recent call last):
File "/home/ml/users/jgonza38/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq_cli/train.py", line 324, in distributed_main
main(args, init_distributed=True)
File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq_cli/train.py", line 117, in main
valid_losses = train(args, trainer, task, epoch_itr, max_update)
File "/home/ml/users/jgonza38/anaconda3/lib/python3.7/contextlib.py", line 74, in inner
return func(*args, **kwds)
File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq_cli/train.py", line 187, in train
log_output = trainer.train_step(samples)
File "/home/ml/users/jgonza38/anaconda3/lib/python3.7/contextlib.py", line 74, in inner
return func(*args, **kwds)
File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq/trainer.py", line 412, in train_step
logging_outputs, sample_size, ooms, ignore=is_dummy_batch,
File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq/trainer.py", line 685, in _aggregate_logging_outputs
logging_outputs, *extra_stats_to_sum, ignore=ignore
File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq/trainer.py", line 746, in _fast_stat_sync_sum
group=self.data_parallel_process_group
File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq/distributed_utils.py", line 292, in all_reduce_dict
cpu_data = _all_reduce_dict(cpu_data)
File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq/distributed_utils.py", line 288, in _all_reduce_dict
buf = torch.stack(list(data.values())).to(device=device)
RuntimeError: CUDA error: the launch timed out and was terminated


I also used CUDA_LAUNCH_BLOCKING=1 to get a "more detailed error":


Process 1 terminated with the following error:
Traceback (most recent call last):
File "/home/ml/users/jgonza38/anaconda3/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq_cli/train.py", line 324, in distributed_main
main(args, init_distributed=True)
File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq_cli/train.py", line 117, in main
valid_losses = train(args, trainer, task, epoch_itr, max_update)
File "/home/ml/users/jgonza38/anaconda3/lib/python3.7/contextlib.py", line 74, in inner
return func(*args, **kwds)
File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq_cli/train.py", line 187, in train
log_output = trainer.train_step(samples)
File "/home/ml/users/jgonza38/anaconda3/lib/python3.7/contextlib.py", line 74, in inner
return func(*args, **kwds)
File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq/trainer.py", line 399, in train_step
raise e
File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq/trainer.py", line 377, in train_step
ignore_grad=is_dummy_batch,
File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq/tasks/fairseq_task.py", line 342, in train_step
optimizer.backward(loss)
File "/home/ml/users/jgonza38/reproduce_master_thesis/fairseq/fairseq/optim/fairseq_optimizer.py", line 81, in backward
loss.backward()
File "/home/ml/users/jgonza38/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 198, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/ml/users/jgonza38/anaconda3/lib/python3.7/site-packages/torch/autograd/init.py", line 100, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1591914855613/work/torch/lib/c10d/ProcessGroupNCCL.cpp:32, unhandled cuda error, NCCL version 2.4.8


My interpretation of the traceback for the first error "RuntimeError: CUDA error: the launch timed out and was terminated" is the _all_reduce_dict takes much time in CPU. It could be related with this? Is there some option to increase the CUDA time out?

Code

CUDA_VISIBLE_DEVICES=0,1 python train.py "$FULL_TASK-bin"
--max-epoch $MAX_EPOCHS
--max-tokens $MAX_TOKENS
--update-freq $UPDATE_FREQ
--lr-scheduler polynomial_decay
--lr $LR
--total-num-update $TOTAL_NUM_UPDATES
--warmup-updates $WARMUP_UPDATES
--restore-file $BART/model.pt
--save-dir $RESULTS_PATH
--task translation
--source-lang source
--target-lang target
--truncate-source
--layernorm-embedding
--share-all-embeddings
--share-decoder-input-output-embed
--reset-optimizer
--reset-dataloader
--reset-meters
--required-batch-size-multiple 1
--arch bart_large
--criterion label_smoothed_cross_entropy
--label-smoothing 0.1
--dropout 0.1
--attention-dropout 0.1
--weight-decay 0.01
--optimizer adam
--adam-betas "(0.9, 0.999)"
--adam-eps 1e-08
--clip-norm 0.1
--no-last-checkpoints
--find-unused-parameters;

What have you tried?

  1. To reduce the --max-tokens in case it was related to a GPU memory limitation.
  2. To use --distributed-no-spawn (#cpp:272, unhandled system error #826)
  3. To use different versions of pytorch (1.4.0 and 1.5.1)
  4. To use different versions of fairseq (the master branch and the latest release of December)
  5. The experiment runs well in another machine with 2x GeForce RTX 2080 Ti with the same python/pytorch/fairseq environment.

What's your environment?

  • fairseq Version (e.g., 1.0 or master): master
  • PyTorch Version (e.g., 1.0): 1.5.1
  • OS (e.g., Linux): Ubuntu 18.04.4 LTS
  • How you installed fairseq (pip, source): pip install --editable ./
  • Build command you used (if compiling from source):
  • Python version: 3.7.3
  • CUDA/cuDNN version: CUDA 10.2, CUDNN 7.6.5
  • GPU models and configuration: 2x GeForce GTX TITAN X
  • Any other relevant information:
@myleott
Copy link

myleott commented Jul 14, 2020

Hmm, I haven't seen this error before. Can you try adding --ddp-backend=no_c10d using the latest master branch?

@jogonba2
Copy link
Author

I also tried adding --ddp-backend=no_c10d with the master branch I used in my experimentation (not the latest one), and it thrown the same error. I will test it with the latest master branch once the GPUs are available and I will edit this comment.

Thank you!

@alsheabi
Copy link

alsheabi commented Feb 7, 2021

@jogonba2 Any solution because I faced similar error :(

@aichase
Copy link

aichase commented Jun 10, 2021

Same error after the validation, any solution?

2 similar comments
@Jxu-Thu
Copy link

Jxu-Thu commented Jul 1, 2021

Same error after the validation, any solution?

@nxcvbc
Copy link

nxcvbc commented Apr 1, 2022

Same error after the validation, any solution?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants