Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 1416) of binary: #8

Open
drilistbox opened this issue Jun 7, 2023 · 1 comment

Comments

@drilistbox
Copy link

drilistbox commented Jun 7, 2023

when I try to train the project with command " bash tools/dist_train.sh /home/com14u07/changyongshu/projects/bev/Anchor3DLane/configs/openlane/anchor3dlane.py 1", the error arise:

2023-06-07 13:25:33,779 - mmseg - INFO - Checkpoints will be saved to projects/bev/Anchor3DLane/output/openlane/anchor3dlane by HardDiskBackend.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 1416) of binary: /workspace/miniconda3/envs/lane3d/bin/python
Traceback (most recent call last):
File "/workspace/miniconda3/envs/lane3d/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/workspace/miniconda3/envs/lane3d/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/workspace/miniconda3/envs/lane3d/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/workspace/miniconda3/envs/lane3d/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/workspace/miniconda3/envs/lane3d/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/workspace/miniconda3/envs/lane3d/lib/python3.7/site-packages/torch/distributed/run.py", line 692, in run
)(*cmd_args)
File "/workspace/miniconda3/envs/lane3d/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 116, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/workspace/miniconda3/envs/lane3d/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:


          tools/train.py FAILED              

=================================================
Root Cause:
[0]:
time: 2023-06-07_13:25:37
rank: 0 (local_rank: 0)
exitcode: -11 (pid: 1416)
error_file: <N/A>
msg: "Signal 11 (SIGSEGV) received by PID 1416"

Other Failures:
<NO_OTHER_FAILURES>


@spyflying
Copy link
Collaborator

It seems that you can train normally for several iterations, but there was an error when saving the checkpoint. I feel that this error message is incomplete. Maybe you can check if there is a more specific error message earlier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants