-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with the world_size argument #18
Comments
Could you please send me the exact code and command you are running? I did not use distributed training or inference. I wonder if the error has to do with your specific setup. I will try to help (: |
Hello! So I tried it with all the four inference commands given in the readme. Here is an example: For context, I am running this code on an A40 GPU in a lab I am part of. What I noticed is that in utils/misc.py, there's a statement that changes args.world_size to os.environ['SLURM_NTASKS']. I ran I changed the environment variable SLURM_NTASKS to 1, and the code did run on a few samples, but then I noticed that the inference code crashes with the following error:
Also, the predicted counts for the few samples the code ran on were very off (most of them are in the hundreds/thousands when the ground truth count is between 10-20). This is likely caused by me manually changing the value of SLURM_NTASKS, but I'm not sure how else the code will run. Thank you for your help! Edit: removing the --crop argument stopped the code from crashing, but the predicted counts are still off. The MAE ends up being around 820.84. Thanks again for your help! |
Hello!
I've been trying to run inference with the CountGD model, but am struggling. I have 1 GPU, and even though the world_size default is supposed to be 1, I get the following output when I run main_inference without --world_size specified explicitly.
world size: 6, world rank: 0, local rank: 0, device_count: 1
torch.distributed.DistStoreError: Timed out after 601 seconds waiting for clients. 1/6 clients joined.
I've also tried using --world_size to explicitly specify the world_size to be 1, but that doesn't work either. What am I doing wrong?
Thanks for your help!
The text was updated successfully, but these errors were encountered: