Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why does it take so long time to start? #27

Closed
pengzhiliang opened this issue Oct 12, 2019 · 9 comments
Closed

Why does it take so long time to start? #27

pengzhiliang opened this issue Oct 12, 2019 · 9 comments

Comments

@pengzhiliang
Copy link

❓ Questions and Help

Hello~
When I start to train RetinaNet with default setting, it is very slow in preparation phase !
Info in the console is as following:

[10/12 14:51:43 detectron2]: Full config saved to output/detectron2/DEBUG/config.yaml
[10/12 14:51:43 d2.utils.env]: Using a generated random seed 43796016
[10/12 15:03:06 d2.engine.defaults]: Model:

from 14:51:43 to 15:03:06, it does not start to train.
Therefore, could you tell me why does it take so long time?
Thank you very much!

@ppwwyyxx
Copy link
Contributor

Please include details following the issue template

@pengzhiliang
Copy link
Author

@ppwwyyxx OK.

I did not modify config file,and just ran command as following:

DIR=output/detectron2/coco/Retinanet
CUDA_VISIBLE_DEVICES=4,5,6,7 python tools/train_net.py --num-gpus 4 --dist-url auto \
                            --config-file configs/COCO-Detection/retinanet_R_50_FPN_1x.yaml \
                            SOLVER.IMS_PER_BATCH 8 SOLVER.BASE_LR 0.005 \
                            MODEL.WEIGHTS models/R-50.pkl \
                            OUTPUT_DIR $DIR

Then, I didn't get error but found it took so long time to start.
Major information in the pycharm console is as following:

[10/12 14:51:43 detectron2]: Full config saved to output/detectron2/DEBUG/config.yaml
[10/12 14:51:43 d2.utils.env]: Using a generated random seed 43796016
[10/12 15:03:06 d2.engine.defaults]: Model:
RetinaNet(
  (backbone): FPN(
    (fpn_lateral3): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1))
    ........

Strangely, from 14:51:43 to 15:03:06, it did not start to train.

And my environment info:

---------------------  --------------------------------------------------
Python                 3.7.4 (default, Aug 13 2019, 20:35:49) [GCC 7.3.0]
Detectron2 Compiler    GCC 5.4
DETECTRON2_ENV_MODULE  <not set>
PyTorch                1.3.0
PyTorch Debug Build    False
CUDA available         True
GPU 0,1,2,3            GeForce RTX 2080 Ti
Pillow                 6.2.0
cv2                    4.1.1
---------------------  --------------------------------------------------
PyTorch built with:
  - GCC 7.3
  - Intel(R) Math Kernel Library Version 2019.0.4 Product Build 20190411 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v0.20.5 (Git Hash 0125f28c61c1f822fd48570b4c1066f96fcb9b2e)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CUDA Runtime 10.1
  - NVCC architecture flags: -gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_50,code=compute_50
  - CuDNN 7.6.3
  - Magma 2.5.1
  - Build settings: BLAS=MKL, BUILD_NAMEDTENSOR=OFF, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow, DISABLE_NUMA=1, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=True, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF, 

In summary, no error happened but it took so long time in preparation phase!

Thank you!

@ppwwyyxx
Copy link
Contributor

Your version of pytorch is not built with the pre-computed code for your GPU architecture. In that case everything will run very slowly at first.

To resolve this you need to find a different build of pytorch or build by yourself.

@pengzhiliang
Copy link
Author

OK, Thanks a lot!

@ppwwyyxx
Copy link
Contributor

ppwwyyxx commented Oct 12, 2019

@soumith we've seen two reports about this issue. It seems like the pytorch 1.3 + cuda 10.1 package on pypi is built with GPU code up to 7.5 architectures, while the package on conda only has GPU code up to 5.0.

To users: use pip install rather than conda install should help

@chenjoya
Copy link

Sorry, I met the same problem here, it take so long time to start ... (pytorch 1.3 + cuda 10.1)

@soumith
Copy link
Member

soumith commented Oct 12, 2019

looking at this issue with hi-pri and tracking it in pytorch/pytorch#27807

@soumith
Copy link
Member

soumith commented Oct 12, 2019

This issue is now fixed with newly updated binaries.
Uninstalling and reinstalling PyTorch from Anaconda will fix it.

@chenjoya
Copy link

Thank you!

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 14, 2021
ppwwyyxx added a commit that referenced this issue Jan 2, 2022
Summary:
Resolves #27. Work in progress.
Pull Request resolved: fairinternal/detectron2#51

Reviewed By: rbgirshick

Differential Revision: D13544596

Pulled By: ppwwyyxx

fbshipit-source-id: 0d7a8fa2ecadb47d88502714a191642ba6e17531
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants