Skip to content

PyTorch MPI-GLOO Runtime Error #4

@vibhatha

Description

@vibhatha
Training Epoch   2/  5:
terminate called after throwing an instance of 'gloo::EnforceNotMet'
  what():  [enforce fail at /pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:490] op.preamble.length <= op.nbytes. 839684 vs 529928
[vibhatha:91092] *** Process received signal ***
[vibhatha:91092] Signal: Aborted (6)
[vibhatha:91092] Signal code:  (-6)
[vibhatha:91092] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x46210)[0x7fc97c47c210]
[vibhatha:91092] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7fc97c47c18b]
[vibhatha:91092] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7fc97c45b859]
[vibhatha:91092] [ 3] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e951)[0x7fc97864c951]
[vibhatha:91092] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa47c)[0x7fc97865847c]
[vibhatha:91092] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa4e7)[0x7fc9786584e7]
[vibhatha:91092] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa799)[0x7fc978658799]
[vibhatha:91092] [ 7] /home/vibhatha/venv/ENVAPPS/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so(_ZN4gloo9transport3tcp4Pair11prepareReadERNS1_2OpERNS_12NonOwningPtrINS1_13UnboundBufferEEER5iovec+0x1d9)[0x7fc94ec9c709]
[vibhatha:91092] [ 8] /home/vibhatha/venv/ENVAPPS/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so(_ZN4gloo9transport3tcp4Pair4readEv+0x62)[0x7fc94eca09d2]
[vibhatha:91092] [ 9] /home/vibhatha/venv/ENVAPPS/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so(_ZN4gloo9transport3tcp4Pair12handleEventsEi+0x1d8)[0x7fc94eca16b8]
[vibhatha:91092] [10] /home/vibhatha/venv/ENVAPPS/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so(_ZN4gloo9transport3tcp4Loop3runEv+0x4e0)[0x7fc94ec91d60]
[vibhatha:91092] [11] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6d84)[0x7fc978684d84]
[vibhatha:91092] [12] /lib/x86_64-linux-gnu/libpthread.so.0(+0x9609)[0x7fc97c41c609]
[vibhatha:91092] [13] /lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7fc97c558293]
[vibhatha:91092] *** End of error message ***
terminate called after throwing an instance of 'gloo::EnforceNotMet'
  what():  [enforce fail at /pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:490] op.preamble.length <= op.nbytes. 1001196 vs 932524
[vibhatha:91093] *** Process received signal ***
[vibhatha:91093] Signal: Aborted (6)
[vibhatha:91093] Signal code:  (-6)
[vibhatha:91093] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x46210)[0x7f88b6c05210]
[vibhatha:91093] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f88b6c0518b]
[vibhatha:91093] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f88b6be4859]
[vibhatha:91093] [ 3] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e951)[0x7f88b2dd5951]
[vibhatha:91093] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa47c)[0x7f88b2de147c]
[vibhatha:91093] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa4e7)[0x7f88b2de14e7]
[vibhatha:91093] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa799)[0x7f88b2de1799]
[vibhatha:91093] [ 7] /home/vibhatha/venv/ENVAPPS/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so(_ZN4gloo9transport3tcp4Pair11prepareReadERNS1_2OpERNS_12NonOwningPtrINS1_13UnboundBufferEEER5iovec+0x1d9)[0x7f8889425709]
[vibhatha:91093] [ 8] /home/vibhatha/venv/ENVAPPS/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so(_ZN4gloo9transport3tcp4Pair4readEv+0x62)[0x7f88894299d2]
[vibhatha:91093] [ 9] /home/vibhatha/venv/ENVAPPS/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so(_ZN4gloo9transport3tcp4Pair12handleEventsEi+0x1d8)[0x7f888942a6b8]
[vibhatha:91093] [10] /home/vibhatha/venv/ENVAPPS/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so(_ZN4gloo9transport3tcp4Loop3runEv+0x4e0)[0x7f888941ad60]
[vibhatha:91093] [11] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6d84)[0x7f88b2e0dd84]
[vibhatha:91093] [12] /lib/x86_64-linux-gnu/libpthread.so.0(+0x9609)[0x7f88b6ba5609]
[vibhatha:91093] [13] /lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f88b6ce1293]
[vibhatha:91093] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
/home/vibhatha/venv/ENVAPPS/lib/python3.8/site-packages/joblib/externals/loky/backend/resource_tracker.py:318: UserWarning: resource_tracker: There appear to be 6 leaked semlock objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/home/vibhatha/venv/ENVAPPS/lib/python3.8/site-packages/joblib/externals/loky/backend/resource_tracker.py:318: UserWarning: resource_tracker: There appear to be 6 leaked folder objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/home/vibhatha/sandbox/UNO/Benchmarks/Pilot1/UnoMT/utils/datasets/drug_resp_dataset.py:241: UserWarning: Heterogeneous Cylon Table Detected!. Use Numpy operations with Caution.
  self.__drug_resp_array = self.__drug_resp_tb.to_numpy(zero_copy_only=False)
/home/vibhatha/venv/ENVAPPS/lib/python3.8/site-packages/torch/optim/lr_scheduler.py:131: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
/home/vibhatha/venv/ENVAPPS/lib/python3.8/site-packages/torch/optim/lr_scheduler.py:156: UserWarning: The epoch parameter in `scheduler.step()` was not necessary and is being deprecated where possible. Please use `scheduler.step()` to step the scheduler. During the deprecation, if epoch is different from None, the closed form is used instead of the new chainable form, where available. Please open an issue if you are unable to replicate your use case: https://github.com/pytorch/pytorch/issues/new/choose.
  warnings.warn(EPOCH_DEPRECATION_WARNING, UserWarning)
Traceback (most recent call last):
  File "unoMT_baseline_pytorch.py", line 90, in <module>
    main()
  File "unoMT_baseline_pytorch.py", line 86, in main
    run(params)
  File "unoMT_baseline_pytorch.py", line 80, in run
    modelUno.train()
  File "/home/vibhatha/sandbox/UNO/Benchmarks/Pilot1/UnoMT/unoMT_pytorch_model.py", line 465, in train
    train_drug_target(device=device,
  File "/home/vibhatha/sandbox/UNO/Benchmarks/Pilot1/UnoMT/networks/functions/drug_target_func.py", line 35, in train_drug_target
    F.nll_loss(input=out_target, target=target).backward()
  File "/home/vibhatha/venv/ENVAPPS/lib/python3.8/site-packages/torch/tensor.py", line 221, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/vibhatha/venv/ENVAPPS/lib/python3.8/site-packages/torch/autograd/__init__.py", line 130, in backward
    Variable._execution_engine.run_backward(
RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [127.0.1.1]:22958
/home/vibhatha/sandbox/UNO/Benchmarks/Pilot1/UnoMT/utils/datasets/drug_resp_dataset.py:241: UserWarning: Heterogeneous Cylon Table Detected!. Use Numpy operations with Caution.
  self.__drug_resp_array = self.__drug_resp_tb.to_numpy(zero_copy_only=False)
/home/vibhatha/venv/ENVAPPS/lib/python3.8/site-packages/torch/optim/lr_scheduler.py:131: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
/home/vibhatha/venv/ENVAPPS/lib/python3.8/site-packages/torch/optim/lr_scheduler.py:156: UserWarning: The epoch parameter in `scheduler.step()` was not necessary and is being deprecated where possible. Please use `scheduler.step()` to step the scheduler. During the deprecation, if epoch is different from None, the closed form is used instead of the new chainable form, where available. Please open an issue if you are unable to replicate your use case: https://github.com/pytorch/pytorch/issues/new/choose.
  warnings.warn(EPOCH_DEPRECATION_WARNING, UserWarning)
Traceback (most recent call last):
  File "unoMT_baseline_pytorch.py", line 90, in <module>
    main()
  File "unoMT_baseline_pytorch.py", line 86, in main
    run(params)
  File "unoMT_baseline_pytorch.py", line 80, in run
    modelUno.train()
  File "/home/vibhatha/sandbox/UNO/Benchmarks/Pilot1/UnoMT/unoMT_pytorch_model.py", line 465, in train
    train_drug_target(device=device,
  File "/home/vibhatha/sandbox/UNO/Benchmarks/Pilot1/UnoMT/networks/functions/drug_target_func.py", line 35, in train_drug_target
    F.nll_loss(input=out_target, target=target).backward()
  File "/home/vibhatha/venv/ENVAPPS/lib/python3.8/site-packages/torch/tensor.py", line 221, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/vibhatha/venv/ENVAPPS/lib/python3.8/site-packages/torch/autograd/__init__.py", line 130, in backward
    Variable._execution_engine.run_backward(
RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:566] Read error [127.0.0.1]:26339: Connection reset by peer
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 0 on node vibhatha exited on signal 6 (Aborted).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions