You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to replicate the experiment by running the pretraining script. This is what I have done till now:
Downloaded the ILSVRC 2017 dataset from ImageNet website and extracted it.
Run the pretraining script by changing the dataset path from the file and setting -n 2 -g 2.
This setting is giving me a timeout error when initializing the Pytorch distributed process group. Can you provide which parameters you used while training?
Thank you
Error:
Traceback (most recent call last):
File "imagenet_pretrain.py", line 424, in <module>
main()
File "imagenet_pretrain.py", line 421, in main
mp.spawn(main_worker, nprocs=args.gpus, args=(args,))
File "/home/shubhanshu/.local/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/shubhanshu/.local/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/home/shubhanshu/.local/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/home/shubhanshu/.local/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/home/shubhanshu/DSA2F/imagenet_pretrain.py", line 256, in main_worker
rank=args.rank)
File "/home/shubhanshu/.local/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 627, in init_process_group
_store_based_barrier(rank, store, timeout)
File "/home/shubhanshu/.local/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 258, in _store_based_barrier
rank, store_key, world_size, worker_count, timeout
RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=4, worker_count=2, timeout=0:30:00)
The text was updated successfully, but these errors were encountered:
I am trying to replicate the experiment by running the pretraining script. This is what I have done till now:
-n 2 -g 2
.This setting is giving me a timeout error when initializing the Pytorch distributed process group. Can you provide which parameters you used while training?
Thank you
Error:
The text was updated successfully, but these errors were encountered: