-
Notifications
You must be signed in to change notification settings - Fork 457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The program stopped running less than an epoch by itself with multi-gpu and print so many message about nccl #224
Comments
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down. |
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group |
2/3 attempts left; will restart worker group INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group |
1/3 attempts left; will restart worker group INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group |
CHILD PROCESS FAILED WITH NO ERROR_FILE from torch.distributed.elastic.multiprocessing.errors import record @record warnings.warn(_no_error_file_warning_msg(rank, failure))
=======================================
|
well, I have no clue. It seems all these problems are related to apex. You may consider or try different apex versions. |
I will try next runing,now I'm using only one gpu with double batchsize |
Good, could you let me know if changing to torch syncbn fix the problem ? Thank you! |
I tryed to use different apex version but failed.If I want to replace apex usage by torch.nn.SyncBatchNorm how can I replace codes such as:
|
torch syncbn also gets convert function |
In det3d.torchie.apis.train.py,I tried to replace: |
[E ProcessGroupNCCL.cpp:566] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800159 milliseconds before timing out.
Traceback (most recent call last):
File "./tools/train.py", line 137, in
main()
File "./tools/train.py", line 132, in main
logger=logger,
File "/home/ruidong/workplace/CenterPoint/det3d/torchie/apis/train.py", line 327, in train_detector
trainer.run(data_loaders, cfg.workflow, cfg.total_epochs, local_rank=cfg.local_rank)
File "/home/ruidong/workplace/CenterPoint/det3d/torchie/trainer/trainer.py", line 543, in run
epoch_runner(data_loaders[i], self.epoch, **kwargs)
File "/home/ruidong/workplace/CenterPoint/det3d/torchie/trainer/trainer.py", line 410, in train
self.model, data_batch, train_mode=True, **kwargs
File "/home/ruidong/workplace/CenterPoint/det3d/torchie/trainer/trainer.py", line 368, in batch_processor_inline
losses = model(example, return_loss=True)
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 799, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ruidong/workplace/CenterPoint/det3d/models/detectors/point_pillars.py", line 48, in forward
x = self.extract_feat(data)
File "/home/ruidong/workplace/CenterPoint/det3d/models/detectors/point_pillars.py", line 29, in extract_feat
x = self.neck(x)
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ruidong/workplace/CenterPoint/det3d/models/necks/rpn.py", line 153, in forward
x = self.blocksi
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ruidong/workplace/CenterPoint/det3d/models/utils/misc.py", line 93, in forward
input = module(input)
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/apex/parallel/optimized_sync_batchnorm.py", line 85, in forward
return SyncBatchnormFunction.apply(input, z, self.weight, self.bias, self.running_mean, self.running_var, self.eps, self.training or not self.track_running_stats, exponential_average_factor, self.process_group, channel_last, self.fuse_relu)
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/apex/parallel/optimized_sync_batchnorm_kernel.py", line 36, in forward
torch.distributed.all_gather(mean_l, mean, process_group)
File "/home/ruidong/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1909, in all_gather
work = group.allgather([tensor_list], [tensor])
RuntimeError: NCCL communicator was aborted on rank 0.
The text was updated successfully, but these errors were encountered: