Issue with the world_size argument #18

shubhramishra07 · 2024-11-12T01:22:44Z

Hello!

I've been trying to run inference with the CountGD model, but am struggling. I have 1 GPU, and even though the world_size default is supposed to be 1, I get the following output when I run main_inference without --world_size specified explicitly.

world size: 6, world rank: 0, local rank: 0, device_count: 1
torch.distributed.DistStoreError: Timed out after 601 seconds waiting for clients. 1/6 clients joined.

I've also tried using --world_size to explicitly specify the world_size to be 1, but that doesn't work either. What am I doing wrong?

Thanks for your help!

niki-amini-naieni · 2024-11-12T10:31:45Z

Could you please send me the exact code and command you are running? I did not use distributed training or inference. I wonder if the error has to do with your specific setup. I will try to help (:

shubhramishra07 · 2024-11-12T22:00:14Z

Hello! So I tried it with all the four inference commands given in the readme. Here is an example:
python -u main_inference.py --output_dir ./countgd_val -c config/cfg_fsc147_val.py --eval --datasets config/datasets_fsc147_val.json --pretrain_model_path checkpoints/checkpoint_fsc147_best.pth --options text_encoder_type=checkpoints/bert-base-uncased --crop --remove_bad_exemplar

For context, I am running this code on an A40 GPU in a lab I am part of. What I noticed is that in utils/misc.py, there's a statement that changes args.world_size to os.environ['SLURM_NTASKS']. I ran $echo SLURM_NTASKS, and that returned 6. So that explains why world_size kept being set to 6 even when I explicitly tried to set it to 1. With SLURM_NTASKS set to 6, I was getting the following timeout error:
torch.distributed.DistStoreError: Timed out after 601 seconds waiting for clients. 1/6 clients joined.

I changed the environment variable SLURM_NTASKS to 1, and the code did run on a few samples, but then I noticed that the inference code crashes with the following error:

File "/scr/shubhra/cs468/CountGD/main_inference.py", line 530, in <module>
    main(args)
  File "/scr/shubhra/cs468/CountGD/main_inference.py", line 382, in main
    count_mae, test_stats, coco_evaluator = evaluate(
  File "/scr/shubhra/miniconda3/envs/cs468/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/scr/shubhra/cs468/CountGD/engine_inference.py", line 954, in evaluate
    abs_errs += get_count_errs(
  File "/scr/shubhra/cs468/CountGD/engine_inference.py", line 460, in get_count_errs
    logits_cropped = torch.cat(logits_cropped)
RuntimeError: torch.cat(): expected a non-empty list of Tensors

Also, the predicted counts for the few samples the code ran on were very off (most of them are in the hundreds/thousands when the ground truth count is between 10-20). This is likely caused by me manually changing the value of SLURM_NTASKS, but I'm not sure how else the code will run. Thank you for your help!

Edit: removing the --crop argument stopped the code from crashing, but the predicted counts are still off. The MAE ends up being around 820.84. Thanks again for your help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with the world_size argument #18

Issue with the world_size argument #18

shubhramishra07 commented Nov 12, 2024 •

edited

Loading

niki-amini-naieni commented Nov 12, 2024

shubhramishra07 commented Nov 12, 2024 •

edited

Loading

Issue with the world_size argument #18

Issue with the world_size argument #18

Comments

shubhramishra07 commented Nov 12, 2024 • edited Loading

niki-amini-naieni commented Nov 12, 2024

shubhramishra07 commented Nov 12, 2024 • edited Loading

shubhramishra07 commented Nov 12, 2024 •

edited

Loading

shubhramishra07 commented Nov 12, 2024 •

edited

Loading