Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The program stopped running less than an epoch by itself with multi-gpu and print so many message about nccl #224

Closed
LittlePotatoChip opened this issue Nov 29, 2021 · 11 comments

Comments

@LittlePotatoChip
Copy link

[E ProcessGroupNCCL.cpp:566] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800159 milliseconds before timing out.
Traceback (most recent call last):
File "./tools/train.py", line 137, in
main()
File "./tools/train.py", line 132, in main
logger=logger,
File "/home/ruidong/workplace/CenterPoint/det3d/torchie/apis/train.py", line 327, in train_detector
trainer.run(data_loaders, cfg.workflow, cfg.total_epochs, local_rank=cfg.local_rank)
File "/home/ruidong/workplace/CenterPoint/det3d/torchie/trainer/trainer.py", line 543, in run
epoch_runner(data_loaders[i], self.epoch, **kwargs)
File "/home/ruidong/workplace/CenterPoint/det3d/torchie/trainer/trainer.py", line 410, in train
self.model, data_batch, train_mode=True, **kwargs
File "/home/ruidong/workplace/CenterPoint/det3d/torchie/trainer/trainer.py", line 368, in batch_processor_inline
losses = model(example, return_loss=True)
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 799, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ruidong/workplace/CenterPoint/det3d/models/detectors/point_pillars.py", line 48, in forward
x = self.extract_feat(data)
File "/home/ruidong/workplace/CenterPoint/det3d/models/detectors/point_pillars.py", line 29, in extract_feat
x = self.neck(x)
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ruidong/workplace/CenterPoint/det3d/models/necks/rpn.py", line 153, in forward
x = self.blocksi
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ruidong/workplace/CenterPoint/det3d/models/utils/misc.py", line 93, in forward
input = module(input)
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/apex/parallel/optimized_sync_batchnorm.py", line 85, in forward
return SyncBatchnormFunction.apply(input, z, self.weight, self.bias, self.running_mean, self.running_var, self.eps, self.training or not self.track_running_stats, exponential_average_factor, self.process_group, channel_last, self.fuse_relu)
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/apex/parallel/optimized_sync_batchnorm_kernel.py", line 36, in forward
torch.distributed.all_gather(mean_l, mean, process_group)
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1909, in all_gather
work = group.allgather([tensor_list], [tensor])
RuntimeError: NCCL communicator was aborted on rank 0.

@LittlePotatoChip
Copy link
Author

[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800159 milliseconds before timing out.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 21024) of binary: /home/ruidong/anaconda3/envs/centerpoint/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed

@LittlePotatoChip
Copy link
Author

INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=1
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0, 1]
role_ranks=[0, 1]
global_ranks=[0, 1]
role_world_sizes=[2, 2]
global_world_sizes=[2, 2]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_ejkv2v7b/none_yx_zs2ai/attempt_1/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_ejkv2v7b/none_yx_zs2ai/attempt_1/1/error.json
No Tensorflow
No Tensorflow
WORLD_SIZE:2
WORLD_SIZE:2
Traceback (most recent call last):
Traceback (most recent call last):
File "./tools/train.py", line 137, in
File "./tools/train.py", line 137, in
main()
main()
File "./tools/train.py", line 86, in main
File "./tools/train.py", line 86, in main
torch.distributed.init_process_group(backend="nccl", init_method="env://")
torch.distributed.init_process_group(backend="nccl", init_method="env://")
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
_store_based_barrier(rank, store, timeout)
_store_based_barrier(rank, store, timeout)
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:30:00)
rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:30:00)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 22884) of binary: /home/ruidong/anaconda3/envs/centerpoint/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED.

@LittlePotatoChip
Copy link
Author

2/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=2
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0, 1]
role_ranks=[0, 1]
global_ranks=[0, 1]
role_world_sizes=[2, 2]
global_world_sizes=[2, 2]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_ejkv2v7b/none_yx_zs2ai/attempt_2/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_ejkv2v7b/none_yx_zs2ai/attempt_2/1/error.json
No Tensorflow
WORLD_SIZE:2
No Tensorflow
WORLD_SIZE:2
Traceback (most recent call last):
File "./tools/train.py", line 137, in
main()
File "./tools/train.py", line 86, in main
torch.distributed.init_process_group(backend="nccl", init_method="env://")
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
_store_based_barrier(rank, store, timeout)
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=2, worker_count=6, timeout=0:30:00)
Traceback (most recent call last):
File "./tools/train.py", line 137, in
main()
File "./tools/train.py", line 86, in main
torch.distributed.init_process_group(backend="nccl", init_method="env://")
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
_store_based_barrier(rank, store, timeout)
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=2, worker_count=6, timeout=0:30:00)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 23002) of binary: /home/ruidong/anaconda3/envs/centerpoint/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED.

@LittlePotatoChip
Copy link
Author

1/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=3
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0, 1]
role_ranks=[0, 1]
global_ranks=[0, 1]
role_world_sizes=[2, 2]
global_world_sizes=[2, 2]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_ejkv2v7b/none_yx_zs2ai/attempt_3/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_ejkv2v7b/none_yx_zs2ai/attempt_3/1/error.json
No Tensorflow
WORLD_SIZE:2
No Tensorflow
WORLD_SIZE:2
Traceback (most recent call last):
File "./tools/train.py", line 137, in
main()
File "./tools/train.py", line 86, in main
torch.distributed.init_process_group(backend="nccl", init_method="env://")
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
_store_based_barrier(rank, store, timeout)
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=2, worker_count=8, timeout=0:30:00)
Traceback (most recent call last):
File "./tools/train.py", line 137, in
main()
File "./tools/train.py", line 86, in main
torch.distributed.init_process_group(backend="nccl", init_method="env://")
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
_store_based_barrier(rank, store, timeout)
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=2, worker_count=8, timeout=0:30:00)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 23112) of binary: /home/ruidong/anaconda3/envs/centerpoint/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (FAILED). Waiting 300 seconds for other agents to finish
/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/distributed/elastic/utils/store.py:71: FutureWarning: This is an experimental API and will be changed in future.
"This is an experimental API and will be changed in future.", FutureWarning
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.005750894546508789 seconds
{"name": "torchelastic.worker.status.FAILED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 0, "group_rank": 0, "worker_id": "23112", "role": "default", "hostname": "ubuntu-X299-UD4-Pro", "state": "FAILED", "total_run_time": 21726, "rdzv_backend": "static", "raw_error": "{"message": ""}", "metadata": "{"group_world_size": 1, "entry_point": "python", "local_rank": [0], "role_rank": [0], "role_world_size": [2]}", "agent_restarts": 3}}
{"name": "torchelastic.worker.status.FAILED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 1, "group_rank": 0, "worker_id": "23113", "role": "default", "hostname": "ubuntu-X299-UD4-Pro", "state": "FAILED", "total_run_time": 21726, "rdzv_backend": "static", "raw_error": "{"message": ""}", "metadata": "{"group_world_size": 1, "entry_point": "python", "local_rank": [1], "role_rank": [1], "role_world_size": [2]}", "agent_restarts": 3}}
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": null, "group_rank": 0, "worker_id": null, "role": "default", "hostname": "ubuntu-X299-UD4-Pro", "state": "SUCCEEDED", "total_run_time": 21726, "rdzv_backend": "static", "raw_error": null, "metadata": "{"group_world_size": 1, "entry_point": "python"}", "agent_restarts": 3}}
/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py:354: UserWarning:

@LittlePotatoChip
Copy link
Author


           CHILD PROCESS FAILED WITH NO ERROR_FILE                

CHILD PROCESS FAILED WITH NO ERROR_FILE
Child process 23112 (local_rank 0) FAILED (exitcode 1)
Error msg: Process failed with exitcode 1
Without writing an error file to <N/A>.
While this DOES NOT affect the correctness of your application,
no trace information about the error will be available for inspection.
Consider decorating your top level entrypoint function with
torch.distributed.elastic.multiprocessing.errors.record. Example:

from torch.distributed.elastic.multiprocessing.errors import record

@record
def trainer_main(args):
# do train


warnings.warn(_no_error_file_warning_msg(rank, failure))
Traceback (most recent call last):
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/distributed/launch.py", line 173, in
main()
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/distributed/launch.py", line 169, in main
run(args)
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/distributed/run.py", line 624, in run
)(*cmd_args)
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 116, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, **kwargs)
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:


    ./tools/train.py FAILED        

=======================================
Root Cause:
[0]:
time: 2021-11-29_01:52:15
rank: 0 (local_rank: 0)
exitcode: 1 (pid: 23112)
error_file: <N/A>
msg: "Process failed with exitcode 1"

Other Failures:
[1]:
time: 2021-11-29_01:52:15
rank: 1 (local_rank: 1)
exitcode: 1 (pid: 23113)
error_file: <N/A>
msg: "Process failed with exitcode 1"


@tianweiy
Copy link
Owner

well, I have no clue. It seems all these problems are related to apex.

You may consider
#202 (comment)

or try different apex versions.

@LittlePotatoChip
Copy link
Author

well, I have no clue. It seems all these problems are related to apex.

You may consider #202 (comment)

or try different apex versions.

I will try next runing,now I'm using only one gpu with double batchsize

@tianweiy
Copy link
Owner

tianweiy commented Dec 3, 2021

Good, could you let me know if changing to torch syncbn fix the problem ? Thank you!

@LittlePotatoChip
Copy link
Author

Good, could you let me know if changing to torch syncbn fix the problem ? Thank you!

I tryed to use different apex version but failed.If I want to replace apex usage by torch.nn.SyncBatchNorm how can I replace codes such as:

model = apex.parallel.convert_syncbn_model(model)

@tianweiy
Copy link
Owner

tianweiy commented Dec 5, 2021

@LittlePotatoChip
Copy link
Author

LittlePotatoChip commented Dec 5, 2021

In det3d.torchie.apis.train.py,I tried to replace:
model = apex.parallel.convert_syncbn_model(model)
with:
model = nn.SyncBatchNorm.convert_sync_batchnorm(model)
It works,thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants