The program stopped running less than an epoch by itself with multi-gpu and print so many message about nccl #224

LittlePotatoChip · 2021-11-29T07:38:20Z

[E ProcessGroupNCCL.cpp:566] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800159 milliseconds before timing out.
Traceback (most recent call last):
File "./tools/train.py", line 137, in
main()
File "./tools/train.py", line 132, in main
logger=logger,
File "/home/ruidong/workplace/CenterPoint/det3d/torchie/apis/train.py", line 327, in train_detector
trainer.run(data_loaders, cfg.workflow, cfg.total_epochs, local_rank=cfg.local_rank)
File "/home/ruidong/workplace/CenterPoint/det3d/torchie/trainer/trainer.py", line 543, in run
epoch_runner(data_loaders[i], self.epoch, **kwargs)
File "/home/ruidong/workplace/CenterPoint/det3d/torchie/trainer/trainer.py", line 410, in train
self.model, data_batch, train_mode=True, **kwargs
File "/home/ruidong/workplace/CenterPoint/det3d/torchie/trainer/trainer.py", line 368, in batch_processor_inline
losses = model(example, return_loss=True)
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 799, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ruidong/workplace/CenterPoint/det3d/models/detectors/point_pillars.py", line 48, in forward
x = self.extract_feat(data)
File "/home/ruidong/workplace/CenterPoint/det3d/models/detectors/point_pillars.py", line 29, in extract_feat
x = self.neck(x)
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ruidong/workplace/CenterPoint/det3d/models/necks/rpn.py", line 153, in forward
x = self.blocksi
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ruidong/workplace/CenterPoint/det3d/models/utils/misc.py", line 93, in forward
input = module(input)
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/apex/parallel/optimized_sync_batchnorm.py", line 85, in forward
return SyncBatchnormFunction.apply(input, z, self.weight, self.bias, self.running_mean, self.running_var, self.eps, self.training or not self.track_running_stats, exponential_average_factor, self.process_group, channel_last, self.fuse_relu)
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/apex/parallel/optimized_sync_batchnorm_kernel.py", line 36, in forward
torch.distributed.all_gather(mean_l, mean, process_group)
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1909, in all_gather
work = group.allgather([tensor_list], [tensor])
RuntimeError: NCCL communicator was aborted on rank 0.

LittlePotatoChip · 2021-11-29T07:38:46Z

[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800159 milliseconds before timing out.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 21024) of binary: /home/ruidong/anaconda3/envs/centerpoint/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed

LittlePotatoChip · 2021-11-29T07:40:27Z

INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=1
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0, 1]
role_ranks=[0, 1]
global_ranks=[0, 1]
role_world_sizes=[2, 2]
global_world_sizes=[2, 2]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_ejkv2v7b/none_yx_zs2ai/attempt_1/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_ejkv2v7b/none_yx_zs2ai/attempt_1/1/error.json
No Tensorflow
No Tensorflow
WORLD_SIZE:2
WORLD_SIZE:2
Traceback (most recent call last):
Traceback (most recent call last):
File "./tools/train.py", line 137, in
File "./tools/train.py", line 137, in
main()
main()
File "./tools/train.py", line 86, in main
File "./tools/train.py", line 86, in main
torch.distributed.init_process_group(backend="nccl", init_method="env://")
torch.distributed.init_process_group(backend="nccl", init_method="env://")
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
_store_based_barrier(rank, store, timeout)
_store_based_barrier(rank, store, timeout)
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:30:00)
rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:30:00)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 22884) of binary: /home/ruidong/anaconda3/envs/centerpoint/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED.

LittlePotatoChip · 2021-11-29T07:40:47Z

2/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=2
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0, 1]
role_ranks=[0, 1]
global_ranks=[0, 1]
role_world_sizes=[2, 2]
global_world_sizes=[2, 2]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_ejkv2v7b/none_yx_zs2ai/attempt_2/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_ejkv2v7b/none_yx_zs2ai/attempt_2/1/error.json
No Tensorflow
WORLD_SIZE:2
No Tensorflow
WORLD_SIZE:2
Traceback (most recent call last):
File "./tools/train.py", line 137, in
main()
File "./tools/train.py", line 86, in main
torch.distributed.init_process_group(backend="nccl", init_method="env://")
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
_store_based_barrier(rank, store, timeout)
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=2, worker_count=6, timeout=0:30:00)
Traceback (most recent call last):
File "./tools/train.py", line 137, in
main()
File "./tools/train.py", line 86, in main
torch.distributed.init_process_group(backend="nccl", init_method="env://")
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
_store_based_barrier(rank, store, timeout)
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=2, worker_count=6, timeout=0:30:00)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 23002) of binary: /home/ruidong/anaconda3/envs/centerpoint/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED.

LittlePotatoChip · 2021-11-29T07:41:24Z

1/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=3
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0, 1]
role_ranks=[0, 1]
global_ranks=[0, 1]
role_world_sizes=[2, 2]
global_world_sizes=[2, 2]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_ejkv2v7b/none_yx_zs2ai/attempt_3/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_ejkv2v7b/none_yx_zs2ai/attempt_3/1/error.json
No Tensorflow
WORLD_SIZE:2
No Tensorflow
WORLD_SIZE:2
Traceback (most recent call last):
File "./tools/train.py", line 137, in
main()
File "./tools/train.py", line 86, in main
torch.distributed.init_process_group(backend="nccl", init_method="env://")
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
_store_based_barrier(rank, store, timeout)
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=2, worker_count=8, timeout=0:30:00)
Traceback (most recent call last):
File "./tools/train.py", line 137, in
main()
File "./tools/train.py", line 86, in main
torch.distributed.init_process_group(backend="nccl", init_method="env://")
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
_store_based_barrier(rank, store, timeout)
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=2, worker_count=8, timeout=0:30:00)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 23112) of binary: /home/ruidong/anaconda3/envs/centerpoint/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (FAILED). Waiting 300 seconds for other agents to finish
/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/distributed/elastic/utils/store.py:71: FutureWarning: This is an experimental API and will be changed in future.
"This is an experimental API and will be changed in future.", FutureWarning
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.005750894546508789 seconds
{"name": "torchelastic.worker.status.FAILED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 0, "group_rank": 0, "worker_id": "23112", "role": "default", "hostname": "ubuntu-X299-UD4-Pro", "state": "FAILED", "total_run_time": 21726, "rdzv_backend": "static", "raw_error": "{"message": ""}", "metadata": "{"group_world_size": 1, "entry_point": "python", "local_rank": [0], "role_rank": [0], "role_world_size": [2]}", "agent_restarts": 3}}
{"name": "torchelastic.worker.status.FAILED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 1, "group_rank": 0, "worker_id": "23113", "role": "default", "hostname": "ubuntu-X299-UD4-Pro", "state": "FAILED", "total_run_time": 21726, "rdzv_backend": "static", "raw_error": "{"message": ""}", "metadata": "{"group_world_size": 1, "entry_point": "python", "local_rank": [1], "role_rank": [1], "role_world_size": [2]}", "agent_restarts": 3}}
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": null, "group_rank": 0, "worker_id": null, "role": "default", "hostname": "ubuntu-X299-UD4-Pro", "state": "SUCCEEDED", "total_run_time": 21726, "rdzv_backend": "static", "raw_error": null, "metadata": "{"group_world_size": 1, "entry_point": "python"}", "agent_restarts": 3}}
/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py:354: UserWarning:

LittlePotatoChip · 2021-11-29T07:41:41Z

           CHILD PROCESS FAILED WITH NO ERROR_FILE

CHILD PROCESS FAILED WITH NO ERROR_FILE
Child process 23112 (local_rank 0) FAILED (exitcode 1)
Error msg: Process failed with exitcode 1
Without writing an error file to <N/A>.
While this DOES NOT affect the correctness of your application,
no trace information about the error will be available for inspection.
Consider decorating your top level entrypoint function with
torch.distributed.elastic.multiprocessing.errors.record. Example:

from torch.distributed.elastic.multiprocessing.errors import record

@record
def trainer_main(args):
# do train

warnings.warn(_no_error_file_warning_msg(rank, failure))
Traceback (most recent call last):
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/distributed/launch.py", line 173, in
main()
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/distributed/launch.py", line 169, in main
run(args)
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/distributed/run.py", line 624, in run
)(*cmd_args)
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 116, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, **kwargs)
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

    ./tools/train.py FAILED

=======================================
Root Cause:
[0]:
time: 2021-11-29_01:52:15
rank: 0 (local_rank: 0)
exitcode: 1 (pid: 23112)
error_file: <N/A>
msg: "Process failed with exitcode 1"

Other Failures:
[1]:
time: 2021-11-29_01:52:15
rank: 1 (local_rank: 1)
exitcode: 1 (pid: 23113)
error_file: <N/A>
msg: "Process failed with exitcode 1"

tianweiy · 2021-11-30T01:57:11Z

well, I have no clue. It seems all these problems are related to apex.

You may consider
#202 (comment)

or try different apex versions.

LittlePotatoChip · 2021-12-01T02:49:00Z

well, I have no clue. It seems all these problems are related to apex.

You may consider #202 (comment)

or try different apex versions.

I will try next runing,now I'm using only one gpu with double batchsize

tianweiy · 2021-12-03T07:58:06Z

Good, could you let me know if changing to torch syncbn fix the problem ? Thank you!

LittlePotatoChip · 2021-12-05T01:13:47Z

Good, could you let me know if changing to torch syncbn fix the problem ? Thank you!

I tryed to use different apex version but failed.If I want to replace apex usage by torch.nn.SyncBatchNorm how can I replace codes such as:

model = apex.parallel.convert_syncbn_model(model)

tianweiy · 2021-12-05T01:14:41Z

https://pytorch.org/docs/master/generated/torch.nn.SyncBatchNorm.html#torch.nn.SyncBatchNorm.convert_sync_batchnorm

torch syncbn also gets convert function

LittlePotatoChip · 2021-12-05T13:12:58Z

In det3d.torchie.apis.train.py,I tried to replace：
model = apex.parallel.convert_syncbn_model(model)
with:
model = nn.SyncBatchNorm.convert_sync_batchnorm(model)
It works,thanks.

LittlePotatoChip closed this as completed Dec 5, 2021

tianweiy mentioned this issue Apr 24, 2022

multi gpus run error after 1 epoch #314

Open

collinmccarthy mentioned this issue Jan 15, 2025

[Bug] convert_sync_batchnorm missing 'training' attribute open-mmlab/mmengine#1624

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The program stopped running less than an epoch by itself with multi-gpu and print so many message about nccl #224

The program stopped running less than an epoch by itself with multi-gpu and print so many message about nccl #224

LittlePotatoChip commented Nov 29, 2021

LittlePotatoChip commented Nov 29, 2021

LittlePotatoChip commented Nov 29, 2021

LittlePotatoChip commented Nov 29, 2021

LittlePotatoChip commented Nov 29, 2021

LittlePotatoChip commented Nov 29, 2021

tianweiy commented Nov 30, 2021

LittlePotatoChip commented Dec 1, 2021

tianweiy commented Dec 3, 2021

LittlePotatoChip commented Dec 5, 2021

tianweiy commented Dec 5, 2021

LittlePotatoChip commented Dec 5, 2021 •

edited

Loading

The program stopped running less than an epoch by itself with multi-gpu and print so many message about nccl #224

The program stopped running less than an epoch by itself with multi-gpu and print so many message about nccl #224

Comments

LittlePotatoChip commented Nov 29, 2021

LittlePotatoChip commented Nov 29, 2021

LittlePotatoChip commented Nov 29, 2021

LittlePotatoChip commented Nov 29, 2021

LittlePotatoChip commented Nov 29, 2021

LittlePotatoChip commented Nov 29, 2021

======================================= Root Cause: [0]: time: 2021-11-29_01:52:15 rank: 0 (local_rank: 0) exitcode: 1 (pid: 23112) error_file: <N/A> msg: "Process failed with exitcode 1"

tianweiy commented Nov 30, 2021

LittlePotatoChip commented Dec 1, 2021

tianweiy commented Dec 3, 2021

LittlePotatoChip commented Dec 5, 2021

tianweiy commented Dec 5, 2021

LittlePotatoChip commented Dec 5, 2021 • edited Loading

=======================================
Root Cause:
[0]:
time: 2021-11-29_01:52:15
rank: 0 (local_rank: 0)
exitcode: 1 (pid: 23112)
error_file: <N/A>
msg: "Process failed with exitcode 1"

LittlePotatoChip commented Dec 5, 2021 •

edited

Loading