You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
when I try to train the project with command " bash tools/dist_train.sh /home/com14u07/changyongshu/projects/bev/Anchor3DLane/configs/openlane/anchor3dlane.py 1", the error arise:
2023-06-07 13:25:33,779 - mmseg - INFO - Checkpoints will be saved to projects/bev/Anchor3DLane/output/openlane/anchor3dlane by HardDiskBackend.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 1416) of binary: /workspace/miniconda3/envs/lane3d/bin/python
Traceback (most recent call last):
File "/workspace/miniconda3/envs/lane3d/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/workspace/miniconda3/envs/lane3d/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/workspace/miniconda3/envs/lane3d/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/workspace/miniconda3/envs/lane3d/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/workspace/miniconda3/envs/lane3d/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/workspace/miniconda3/envs/lane3d/lib/python3.7/site-packages/torch/distributed/run.py", line 692, in run
)(*cmd_args)
File "/workspace/miniconda3/envs/lane3d/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 116, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/workspace/miniconda3/envs/lane3d/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
It seems that you can train normally for several iterations, but there was an error when saving the checkpoint. I feel that this error message is incomplete. Maybe you can check if there is a more specific error message earlier.
when I try to train the project with command " bash tools/dist_train.sh /home/com14u07/changyongshu/projects/bev/Anchor3DLane/configs/openlane/anchor3dlane.py 1", the error arise:
2023-06-07 13:25:33,779 - mmseg - INFO - Checkpoints will be saved to projects/bev/Anchor3DLane/output/openlane/anchor3dlane by HardDiskBackend.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 1416) of binary: /workspace/miniconda3/envs/lane3d/bin/python
Traceback (most recent call last):
File "/workspace/miniconda3/envs/lane3d/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/workspace/miniconda3/envs/lane3d/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/workspace/miniconda3/envs/lane3d/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/workspace/miniconda3/envs/lane3d/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/workspace/miniconda3/envs/lane3d/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/workspace/miniconda3/envs/lane3d/lib/python3.7/site-packages/torch/distributed/run.py", line 692, in run
)(*cmd_args)
File "/workspace/miniconda3/envs/lane3d/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 116, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/workspace/miniconda3/envs/lane3d/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=================================================
Root Cause:
[0]:
time: 2023-06-07_13:25:37
rank: 0 (local_rank: 0)
exitcode: -11 (pid: 1416)
error_file: <N/A>
msg: "Signal 11 (SIGSEGV) received by PID 1416"
Other Failures:
<NO_OTHER_FAILURES>
The text was updated successfully, but these errors were encountered: