Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Frequently Asked Questions #109

Open
rentainhe opened this issue Oct 19, 2022 · 12 comments
Open

Frequently Asked Questions #109

rentainhe opened this issue Oct 19, 2022 · 12 comments
Assignees
Labels
question Further information is requested

Comments

@rentainhe
Copy link
Collaborator

rentainhe commented Oct 19, 2022

We keep this issue open to collect frequently asked questions and their solutions from the users.

Feel free to leave your comment here if you find any frequent issues and have ways to help others to solve them.

Notes

  • If you meed some convergence problem with less gpus, it's better to set a larger batch-size (batch-size=8/16) by setting dataloader.train.total_batch_size for training as mentioned in this issue: Convergence problem on coco with less gpus. #219

FAQs

1. ImportError: Cannot import 'detrex._C', therefore 'MultiScaleDeformableAttention' is not available.

detrex need CUDA runtime to build the MultiScaleDeformableAttention operator. In most cases, users do not need to specify this environment variable if you have installed cuda correctly. The default path of CUDA runtime is usr/local/cuda. If you find your CUDA_HOME is None. You may solve it as follows:

  • If you've already installed CUDA runtime in your environments, specify the environment variable (here we take cuda-11.3 as an example):
export CUDA_HOME=/path/to/cuda-11.3/
  • If you do not find the CUDA runtime in your environments, consider install it following the CUDA Toolkit Installation to install CUDA. Then specify the environment variable CUDA_HOME.
  • After setting CUDA_HOME, rebuild detrex again by running pip install -e .

You can also refer to these issues for more details: #98, #85

2. How to not filter empty annotations during training.

There're three ways for you to not filter empty annotations during training.

  1. modify configs in configs/common/data/coco_detr.py as follows:
dataloader.train = L(build_detection_train_loader)(
    dataset=L(get_detection_dataset_dicts)(names="coco_2017_train", filter_empty=False),
    ...,
)
  1. modify configs in projects as dino_r50_4scale_24ep.py.
# your config.py
dataloader = get_config("common/data/coco_detr.py").dataloader

# modify dataloader config
# not filter empty annotations during training
dataloader.train.dataset.filter_empty = False
  1. modify your training scripts to override the config.
cd detrex
python tools/train_net.py --config-file projects/dino/configs/path/to/config.py --num-gpus 8 dataloader.train.dataset.filter_empy=False

You can also refer to these issues for more details: #78 (comment)

3. RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:54980 (errno: 98 - Address already in use).

This means that the process you started earlier did not exit correctly, there's two solution:

  1. kill the process you started before totally
  2. change the running port by setting --dist-url
python tools/train_net.py \
    --config-file path/to/config.py \
    --num-gpus 8 \
    --dist-url tcp://127.0.0.1:12345 \
4. DINO CPU inference Please refer to this PR #157 for more details
5. Training coco-like custom dataset Please refer to this PR #186 for more details.
@ichitaka

This comment was marked as outdated.

@rentainhe

This comment was marked as outdated.

@hg6185
Copy link

hg6185 commented Jul 25, 2023

Hello,
I'm trying to install detrex on an hpc with Nvidia V100. I managed to set the path CUDA_HOME to path/CUDA/11.8.0

When I run the pip install -e . again, Im getting the following warning & error:

warning: nvcc warning : incompatible redefinition for option 'std', the last value of this option was used (I think this relates to one argument -std=c++17)

error:
/.../miniconda3/envs/fps-bm/lib/python3.10/site-packages/torch/include/c10/util/Half.h(73): error: identifier "_castu32_f32" is undefined

/.../miniconda3/envs/fps-bm/lib/python3.10/site-packages/torch/include/c10/util/Half.h(89): error: identifier "_castf32_u32" is undefined

2 errors detected in the compilation of "/.../detrex/detrex/layers/csrc/DCNv3/dcnv3_cuda.cu".
error: command '.../software/CUDA/11.8.0/bin/nvcc' failed with exit code 2

Did you ever encounter this and do you know a fix?
My gcc is 11.3 and supports c++17
Thanks in advance

@rentainhe
Copy link
Collaborator Author

rentainhe commented Jul 25, 2023

Hello @hg6185

Seems like dcn_v3 operator not suitable for this environment, you can try this two ways:

  • search relative issue in InternImage repo here to see if there're same issues
  • remove this operator if you do not need to benchmark your model on InterImage backbone and re-compile detrex again

this is InternImage's official repo: https://github.com/OpenGVLab/InternImage

Seems like they already have python package for this operator: https://github.com/OpenGVLab/InternImage/releases/tag/whl_files

We will update detrex recently to remove such compiling process for this operator

@hg6185
Copy link

hg6185 commented Jul 25, 2023

Thanks for the quick reply @rentainhe!
Unfortunately, that's not the thing. I removed and reinstalled everything including detectron2 which now cannot be installed due to the same issue.
It seems to be a problem with c++ imports in PyTorch.

@rentainhe
Copy link
Collaborator Author

rentainhe commented Jul 26, 2023

Thanks for the quick reply @rentainhe! Unfortunately, that's not the thing. I removed and reinstalled everything including detectron2 which now cannot be installed due to the same issue. It seems to be a problem with c++ imports in PyTorch.

I'm sorry to hear that. I suggest you could try lowering the PyTorch version to see if it helps to bypass this issue. @hg6185

@hg6185
Copy link

hg6185 commented Jul 26, 2023

Hi again @rentainhe ,
I found the problem. The Gcc version was incompatible with CUDA. Note that you should have a GCC that is < 10.
In my case, everything works fine with CUDA 11.3.1 and GCC 9.4.0. Thanks again for the quick support!

@rentainhe
Copy link
Collaborator Author

Hi again @rentainhe , I found the problem. The Gcc version was incompatible with CUDA. Note that you should have a GCC that is < 10. In my case, everything works fine with CUDA 11.3.1 and GCC 9.4.0. Thanks again for the quick support!

Would you like to add this situation in our FAQs here: #109 (comment)

@hg6185
Copy link

hg6185 commented Jul 27, 2023

Hi @rentainhe ,

I can add this, but what do you mean? :D
Do you want me to write a comment that makes a little summary, so you can delete the rest?

@rentainhe
Copy link
Collaborator Author

Hi @rentainhe ,

I can add this, but what do you mean? :D Do you want me to write a comment that makes a little summary, so you can delete the rest?

Yes, I was wondering if it's better to add it to somewhere or just keep our conversation here to help others who have met the same problem

@hg6185
Copy link

hg6185 commented Aug 1, 2023

hi @rentainhe
a summary of what fixed issue 1 for me: The 'latest' Detectron2 release requires a gcc version that is lower than 10.0.0. I am working on a HPC and I am able to load different CUDAs and GCCs which is practical in this case.

In order to build Detectron2 and Detrex, I used a miniconda env with CUDA 11.3.1 and gcc 9.4.0. I use PyTorch 3.8 which can be installed by this command (I post it here, because you will have to search for it since it's older):
conda install pytorch torchvision torchaudio pytorch-cuda=11.3 -c pytorch -c nvidia

Don't forget the Nvidia Toolkit matching with your version.
Note that there are some libs like matplotlib that needed to be deprecated to match an older gcc and Python version.
In general, you probably will encounter some issues on the way, but I managed to find a solution to all of them.

For instance, If you get an error with pycocotools, do pip uninstall and conda install (from conda forge)

@rentainhe
Copy link
Collaborator Author

hi @rentainhe a summary of what fixed issue 1 for me: The 'latest' Detectron2 release requires a gcc version that is lower than 10.0.0. I am working on a HPC and I am able to load different CUDAs and GCCs which is practical in this case.

In order to build Detectron2 and Detrex, I used a miniconda env with CUDA 11.3.1 and gcc 9.4.0. I use PyTorch 3.8 which can be installed by this command (I post it here, because you will have to search for it since it's older): conda install pytorch torchvision torchaudio pytorch-cuda=11.3 -c pytorch -c nvidia

Don't forget the Nvidia Toolkit matching with your version. Note that there are some libs like matplotlib that needed to be deprecated to match an older gcc and Python version. In general, you probably will encounter some issues on the way, but I managed to find a solution to all of them.

For instance, If you get an error with pycocotools, do pip uninstall and conda install (from conda forge)

Thank you so much for summarizing this! It's really useful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

6 participants