-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Training Epoch 2/ 5:
terminate called after throwing an instance of 'gloo::EnforceNotMet'
what(): [enforce fail at /pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:490] op.preamble.length <= op.nbytes. 839684 vs 529928
[vibhatha:91092] *** Process received signal ***
[vibhatha:91092] Signal: Aborted (6)
[vibhatha:91092] Signal code: (-6)
[vibhatha:91092] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x46210)[0x7fc97c47c210]
[vibhatha:91092] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7fc97c47c18b]
[vibhatha:91092] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7fc97c45b859]
[vibhatha:91092] [ 3] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e951)[0x7fc97864c951]
[vibhatha:91092] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa47c)[0x7fc97865847c]
[vibhatha:91092] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa4e7)[0x7fc9786584e7]
[vibhatha:91092] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa799)[0x7fc978658799]
[vibhatha:91092] [ 7] /home/vibhatha/venv/ENVAPPS/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so(_ZN4gloo9transport3tcp4Pair11prepareReadERNS1_2OpERNS_12NonOwningPtrINS1_13UnboundBufferEEER5iovec+0x1d9)[0x7fc94ec9c709]
[vibhatha:91092] [ 8] /home/vibhatha/venv/ENVAPPS/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so(_ZN4gloo9transport3tcp4Pair4readEv+0x62)[0x7fc94eca09d2]
[vibhatha:91092] [ 9] /home/vibhatha/venv/ENVAPPS/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so(_ZN4gloo9transport3tcp4Pair12handleEventsEi+0x1d8)[0x7fc94eca16b8]
[vibhatha:91092] [10] /home/vibhatha/venv/ENVAPPS/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so(_ZN4gloo9transport3tcp4Loop3runEv+0x4e0)[0x7fc94ec91d60]
[vibhatha:91092] [11] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6d84)[0x7fc978684d84]
[vibhatha:91092] [12] /lib/x86_64-linux-gnu/libpthread.so.0(+0x9609)[0x7fc97c41c609]
[vibhatha:91092] [13] /lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7fc97c558293]
[vibhatha:91092] *** End of error message ***
terminate called after throwing an instance of 'gloo::EnforceNotMet'
what(): [enforce fail at /pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:490] op.preamble.length <= op.nbytes. 1001196 vs 932524
[vibhatha:91093] *** Process received signal ***
[vibhatha:91093] Signal: Aborted (6)
[vibhatha:91093] Signal code: (-6)
[vibhatha:91093] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x46210)[0x7f88b6c05210]
[vibhatha:91093] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f88b6c0518b]
[vibhatha:91093] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f88b6be4859]
[vibhatha:91093] [ 3] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e951)[0x7f88b2dd5951]
[vibhatha:91093] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa47c)[0x7f88b2de147c]
[vibhatha:91093] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa4e7)[0x7f88b2de14e7]
[vibhatha:91093] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa799)[0x7f88b2de1799]
[vibhatha:91093] [ 7] /home/vibhatha/venv/ENVAPPS/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so(_ZN4gloo9transport3tcp4Pair11prepareReadERNS1_2OpERNS_12NonOwningPtrINS1_13UnboundBufferEEER5iovec+0x1d9)[0x7f8889425709]
[vibhatha:91093] [ 8] /home/vibhatha/venv/ENVAPPS/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so(_ZN4gloo9transport3tcp4Pair4readEv+0x62)[0x7f88894299d2]
[vibhatha:91093] [ 9] /home/vibhatha/venv/ENVAPPS/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so(_ZN4gloo9transport3tcp4Pair12handleEventsEi+0x1d8)[0x7f888942a6b8]
[vibhatha:91093] [10] /home/vibhatha/venv/ENVAPPS/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so(_ZN4gloo9transport3tcp4Loop3runEv+0x4e0)[0x7f888941ad60]
[vibhatha:91093] [11] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6d84)[0x7f88b2e0dd84]
[vibhatha:91093] [12] /lib/x86_64-linux-gnu/libpthread.so.0(+0x9609)[0x7f88b6ba5609]
[vibhatha:91093] [13] /lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f88b6ce1293]
[vibhatha:91093] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
/home/vibhatha/venv/ENVAPPS/lib/python3.8/site-packages/joblib/externals/loky/backend/resource_tracker.py:318: UserWarning: resource_tracker: There appear to be 6 leaked semlock objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/home/vibhatha/venv/ENVAPPS/lib/python3.8/site-packages/joblib/externals/loky/backend/resource_tracker.py:318: UserWarning: resource_tracker: There appear to be 6 leaked folder objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/home/vibhatha/sandbox/UNO/Benchmarks/Pilot1/UnoMT/utils/datasets/drug_resp_dataset.py:241: UserWarning: Heterogeneous Cylon Table Detected!. Use Numpy operations with Caution.
self.__drug_resp_array = self.__drug_resp_tb.to_numpy(zero_copy_only=False)
/home/vibhatha/venv/ENVAPPS/lib/python3.8/site-packages/torch/optim/lr_scheduler.py:131: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
/home/vibhatha/venv/ENVAPPS/lib/python3.8/site-packages/torch/optim/lr_scheduler.py:156: UserWarning: The epoch parameter in `scheduler.step()` was not necessary and is being deprecated where possible. Please use `scheduler.step()` to step the scheduler. During the deprecation, if epoch is different from None, the closed form is used instead of the new chainable form, where available. Please open an issue if you are unable to replicate your use case: https://github.com/pytorch/pytorch/issues/new/choose.
warnings.warn(EPOCH_DEPRECATION_WARNING, UserWarning)
Traceback (most recent call last):
File "unoMT_baseline_pytorch.py", line 90, in <module>
main()
File "unoMT_baseline_pytorch.py", line 86, in main
run(params)
File "unoMT_baseline_pytorch.py", line 80, in run
modelUno.train()
File "/home/vibhatha/sandbox/UNO/Benchmarks/Pilot1/UnoMT/unoMT_pytorch_model.py", line 465, in train
train_drug_target(device=device,
File "/home/vibhatha/sandbox/UNO/Benchmarks/Pilot1/UnoMT/networks/functions/drug_target_func.py", line 35, in train_drug_target
F.nll_loss(input=out_target, target=target).backward()
File "/home/vibhatha/venv/ENVAPPS/lib/python3.8/site-packages/torch/tensor.py", line 221, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/vibhatha/venv/ENVAPPS/lib/python3.8/site-packages/torch/autograd/__init__.py", line 130, in backward
Variable._execution_engine.run_backward(
RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [127.0.1.1]:22958
/home/vibhatha/sandbox/UNO/Benchmarks/Pilot1/UnoMT/utils/datasets/drug_resp_dataset.py:241: UserWarning: Heterogeneous Cylon Table Detected!. Use Numpy operations with Caution.
self.__drug_resp_array = self.__drug_resp_tb.to_numpy(zero_copy_only=False)
/home/vibhatha/venv/ENVAPPS/lib/python3.8/site-packages/torch/optim/lr_scheduler.py:131: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
/home/vibhatha/venv/ENVAPPS/lib/python3.8/site-packages/torch/optim/lr_scheduler.py:156: UserWarning: The epoch parameter in `scheduler.step()` was not necessary and is being deprecated where possible. Please use `scheduler.step()` to step the scheduler. During the deprecation, if epoch is different from None, the closed form is used instead of the new chainable form, where available. Please open an issue if you are unable to replicate your use case: https://github.com/pytorch/pytorch/issues/new/choose.
warnings.warn(EPOCH_DEPRECATION_WARNING, UserWarning)
Traceback (most recent call last):
File "unoMT_baseline_pytorch.py", line 90, in <module>
main()
File "unoMT_baseline_pytorch.py", line 86, in main
run(params)
File "unoMT_baseline_pytorch.py", line 80, in run
modelUno.train()
File "/home/vibhatha/sandbox/UNO/Benchmarks/Pilot1/UnoMT/unoMT_pytorch_model.py", line 465, in train
train_drug_target(device=device,
File "/home/vibhatha/sandbox/UNO/Benchmarks/Pilot1/UnoMT/networks/functions/drug_target_func.py", line 35, in train_drug_target
F.nll_loss(input=out_target, target=target).backward()
File "/home/vibhatha/venv/ENVAPPS/lib/python3.8/site-packages/torch/tensor.py", line 221, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/vibhatha/venv/ENVAPPS/lib/python3.8/site-packages/torch/autograd/__init__.py", line 130, in backward
Variable._execution_engine.run_backward(
RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:566] Read error [127.0.0.1]:26339: Connection reset by peer
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 0 on node vibhatha exited on signal 6 (Aborted).
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels