Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with the world_size argument #18

Open
shubhramishra07 opened this issue Nov 12, 2024 · 2 comments
Open

Issue with the world_size argument #18

shubhramishra07 opened this issue Nov 12, 2024 · 2 comments

Comments

@shubhramishra07
Copy link

shubhramishra07 commented Nov 12, 2024

Hello!

I've been trying to run inference with the CountGD model, but am struggling. I have 1 GPU, and even though the world_size default is supposed to be 1, I get the following output when I run main_inference without --world_size specified explicitly.

world size: 6, world rank: 0, local rank: 0, device_count: 1
torch.distributed.DistStoreError: Timed out after 601 seconds waiting for clients. 1/6 clients joined.

I've also tried using --world_size to explicitly specify the world_size to be 1, but that doesn't work either. What am I doing wrong?

Thanks for your help!

@niki-amini-naieni
Copy link
Owner

Could you please send me the exact code and command you are running? I did not use distributed training or inference. I wonder if the error has to do with your specific setup. I will try to help (:

@shubhramishra07
Copy link
Author

shubhramishra07 commented Nov 12, 2024

Hello! So I tried it with all the four inference commands given in the readme. Here is an example:
python -u main_inference.py --output_dir ./countgd_val -c config/cfg_fsc147_val.py --eval --datasets config/datasets_fsc147_val.json --pretrain_model_path checkpoints/checkpoint_fsc147_best.pth --options text_encoder_type=checkpoints/bert-base-uncased --crop --remove_bad_exemplar

For context, I am running this code on an A40 GPU in a lab I am part of. What I noticed is that in utils/misc.py, there's a statement that changes args.world_size to os.environ['SLURM_NTASKS']. I ran $echo SLURM_NTASKS, and that returned 6. So that explains why world_size kept being set to 6 even when I explicitly tried to set it to 1. With SLURM_NTASKS set to 6, I was getting the following timeout error:
torch.distributed.DistStoreError: Timed out after 601 seconds waiting for clients. 1/6 clients joined.

I changed the environment variable SLURM_NTASKS to 1, and the code did run on a few samples, but then I noticed that the inference code crashes with the following error:

File "/scr/shubhra/cs468/CountGD/main_inference.py", line 530, in <module>
    main(args)
  File "/scr/shubhra/cs468/CountGD/main_inference.py", line 382, in main
    count_mae, test_stats, coco_evaluator = evaluate(
  File "/scr/shubhra/miniconda3/envs/cs468/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/scr/shubhra/cs468/CountGD/engine_inference.py", line 954, in evaluate
    abs_errs += get_count_errs(
  File "/scr/shubhra/cs468/CountGD/engine_inference.py", line 460, in get_count_errs
    logits_cropped = torch.cat(logits_cropped)
RuntimeError: torch.cat(): expected a non-empty list of Tensors

Also, the predicted counts for the few samples the code ran on were very off (most of them are in the hundreds/thousands when the ground truth count is between 10-20). This is likely caused by me manually changing the value of SLURM_NTASKS, but I'm not sure how else the code will run. Thank you for your help!

Edit: removing the --crop argument stopped the code from crashing, but the predicted counts are still off. The MAE ends up being around 820.84. Thanks again for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants