Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Argument Values for Pretraining Script #11

Open
shubhanshu02 opened this issue Nov 14, 2022 · 0 comments
Open

Argument Values for Pretraining Script #11

shubhanshu02 opened this issue Nov 14, 2022 · 0 comments

Comments

@shubhanshu02
Copy link

I am trying to replicate the experiment by running the pretraining script. This is what I have done till now:

  • Downloaded the ILSVRC 2017 dataset from ImageNet website and extracted it.
  • Run the pretraining script by changing the dataset path from the file and setting -n 2 -g 2.

This setting is giving me a timeout error when initializing the Pytorch distributed process group. Can you provide which parameters you used while training?

Thank you

Error:

Traceback (most recent call last):
  File "imagenet_pretrain.py", line 424, in <module>
    main()
  File "imagenet_pretrain.py", line 421, in main
    mp.spawn(main_worker, nprocs=args.gpus, args=(args,))
  File "/home/shubhanshu/.local/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/shubhanshu/.local/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/home/shubhanshu/.local/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/shubhanshu/.local/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/home/shubhanshu/DSA2F/imagenet_pretrain.py", line 256, in main_worker
    rank=args.rank)
  File "/home/shubhanshu/.local/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 627, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/home/shubhanshu/.local/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 258, in _store_based_barrier
    rank, store_key, world_size, worker_count, timeout
RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=4, worker_count=2, timeout=0:30:00)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant