ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 1416) of binary: #8

drilistbox · 2023-06-07T05:28:08Z

when I try to train the project with command " bash tools/dist_train.sh /home/com14u07/changyongshu/projects/bev/Anchor3DLane/configs/openlane/anchor3dlane.py 1", the error arise：

2023-06-07 13:25:33,779 - mmseg - INFO - Checkpoints will be saved to projects/bev/Anchor3DLane/output/openlane/anchor3dlane by HardDiskBackend.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 1416) of binary: /workspace/miniconda3/envs/lane3d/bin/python
Traceback (most recent call last):
File "/workspace/miniconda3/envs/lane3d/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/workspace/miniconda3/envs/lane3d/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/workspace/miniconda3/envs/lane3d/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/workspace/miniconda3/envs/lane3d/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/workspace/miniconda3/envs/lane3d/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/workspace/miniconda3/envs/lane3d/lib/python3.7/site-packages/torch/distributed/run.py", line 692, in run
)(*cmd_args)
File "/workspace/miniconda3/envs/lane3d/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 116, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/workspace/miniconda3/envs/lane3d/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

          tools/train.py FAILED

=================================================
Root Cause:
[0]:
time: 2023-06-07_13:25:37
rank: 0 (local_rank: 0)
exitcode: -11 (pid: 1416)
error_file: <N/A>
msg: "Signal 11 (SIGSEGV) received by PID 1416"

Other Failures:
<NO_OTHER_FAILURES>

The text was updated successfully, but these errors were encountered:

spyflying · 2023-06-09T07:09:56Z

It seems that you can train normally for several iterations, but there was an error when saving the checkpoint. I feel that this error message is incomplete. Maybe you can check if there is a more specific error message earlier.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 1416) of binary: #8

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 1416) of binary: #8

drilistbox commented Jun 7, 2023 •

edited

Loading

spyflying commented Jun 9, 2023

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 1416) of binary: #8

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 1416) of binary: #8

Comments

drilistbox commented Jun 7, 2023 • edited Loading

================================================= Root Cause: [0]: time: 2023-06-07_13:25:37 rank: 0 (local_rank: 0) exitcode: -11 (pid: 1416) error_file: <N/A> msg: "Signal 11 (SIGSEGV) received by PID 1416"

spyflying commented Jun 9, 2023

drilistbox commented Jun 7, 2023 •

edited

Loading

=================================================
Root Cause:
[0]:
time: 2023-06-07_13:25:37
rank: 0 (local_rank: 0)
exitcode: -11 (pid: 1416)
error_file: <N/A>
msg: "Signal 11 (SIGSEGV) received by PID 1416"