-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error during training #25
Comments
Been treated in recent commits |
Hi @theodupuis , I'll look into a better solution for this one. The temporary one is to set Best, |
Hi,
Thank you for your help !
Best regards
Théo
Téléchargez Outlook pour iOS<https://aka.ms/o0ukef>
…________________________________
De : Michael Baumgartner ***@***.***>
Envoyé : Thursday, August 19, 2021 4:28:24 PM
À : MIC-DKFZ/nnDetection ***@***.***>
Cc : Theo Dupuis (Student at CentraleSupelec) ***@***.***>; Mention ***@***.***>
Objet : Re: [MIC-DKFZ/nnDetection] Error during training (#25)
Hi @theodupuis<https://github.com/theodupuis> ,
I'll look into a better solution for this one.
The temporary one is to set move_metrics_to_cpu=False but I'm not really happy with that. If you encounter any memory leaks, set it to True and downgrade lightning for now.
Best,
Michael
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#25 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AVIST6I7V2YUA5QLKVF56WLT5UIIRANCNFSM5COKKMTA>.
Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>.
|
Hi,
One last question if I may, now that the training is running, I reached epoch 3 overnight (500images 512x512x100) but is it normal that it takes so much Time?
Téléchargez Outlook pour iOS<https://aka.ms/o0ukef>
…________________________________
De : Theo Dupuis (Student at CentraleSupelec) ***@***.***>
Envoyé : Thursday, August 19, 2021 4:38:40 PM
À : MIC-DKFZ/nnDetection ***@***.***>; MIC-DKFZ/nnDetection ***@***.***>
Cc : Mention ***@***.***>
Objet : Re: [MIC-DKFZ/nnDetection] Error during training (#25)
Hi,
Thank you for your help !
Best regards
Théo
Téléchargez Outlook pour iOS<https://aka.ms/o0ukef>
________________________________
De : Michael Baumgartner ***@***.***>
Envoyé : Thursday, August 19, 2021 4:28:24 PM
À : MIC-DKFZ/nnDetection ***@***.***>
Cc : Theo Dupuis (Student at CentraleSupelec) ***@***.***>; Mention ***@***.***>
Objet : Re: [MIC-DKFZ/nnDetection] Error during training (#25)
Hi @theodupuis<https://github.com/theodupuis> ,
I'll look into a better solution for this one.
The temporary one is to set move_metrics_to_cpu=False but I'm not really happy with that. If you encounter any memory leaks, set it to True and downgrade lightning for now.
Best,
Michael
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#25 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AVIST6I7V2YUA5QLKVF56WLT5UIIRANCNFSM5COKKMTA>.
Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>.
|
Hi, the training time of nnDetection should be roughly equal for most (there are some exceptions) data sets: 2 days with mixed precision 3d speed up and 4 days without. Your time sounds quite slow though. Generally speaking there could be two reasons:
Best, |
Hi,
Thank you for all your answers. One last question and I stop bothering you, if I understand well the algorithm you train several models with different parameters to choose the so called « empirical parameters » right ? Hence the 5 days of training. Thus if this is true how many models are created during the training phase ?
Best regards
Théo
Téléchargez Outlook pour iOS<https://aka.ms/o0ukef>
…________________________________
De : Michael Baumgartner ***@***.***>
Envoyé : Friday, August 20, 2021 11:06:12 AM
À : MIC-DKFZ/nnDetection ***@***.***>
Cc : Theo Dupuis (Student at CentraleSupelec) ***@***.***>; Mention ***@***.***>
Objet : Re: [MIC-DKFZ/nnDetection] Error during training (#25)
Hi,
the training time of nnDetection should be roughly equal for most (there are some exceptions) data sets: 2 days with mixed precision 3d speed up and 4 days without. Your time sounds quite slow though. Generally speaking there could be two reasons:
1. PyTorch < 1.9 did not provide training speedup for mixed-precision 3d convs in their pip installable version and it was necessary to build it from source. I didn't test PyTorch 1.9 yet. (the docker build of nnDetection also provides the speedup)
2. There is a bottleneck in your configuration / setup. This can be identified as follows:
Check the GPU Util -> it should be high for most of the time if it isn't, there is either a CPU or IO bottleneck. If it is high it is the missing pytorch speed up.
Check CPU util: if the CPU util is high (and the GPU util isn't) more cpu threads are needed for augmentation (can be adjusted via det_num_threads and depends on your CPU).
If GPU and CPU util are low, it is an IO bottleneck, it is quite hard to do anything about this (a typical SSD with ~500mb/s read speed ran fine for my experiments).
Best,
Michael
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#25 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AVIST6IP27L6G247QDNPJGLT5YLIJANCNFSM5COKKMTA>.
Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>.
|
Hi @theodupuis , during the training process, only a single model is trained. The empirical parameters refer to several postprocessing parameters (i.e. IoU threshold for NMS, IoU threshold for Weighted Box Clustering) which do not require additional models (it is not a classical Auto ML approach where models are trained several times). Those parameters are optimized by empirically trying them on the validation data. Best, |
Bug : During the training phase
File "/anaconda/lib/python3.8/site-packages/torch/cuda/amp/grad_scaler.py", line 161, in scale
assert outputs.is_cuda or outputs.device.type == 'xla'
AssertionError
Exception ignored in: <function tqdm.del at 0x7f9ba338de50>
Traceback (most recent call last):
File "/anaconda/lib/python3.8/site-packages/tqdm/std.py", line 1145, in del
File "/anaconda/lib/python3.8/site-packages/tqdm/std.py", line 1299, in close
File "/anaconda/lib/python3.8/site-packages/tqdm/std.py", line 1492, in display
File "/anaconda/lib/python3.8/site-packages/tqdm/std.py", line 1148, in str
File "/anaconda/lib/python3.8/site-packages/tqdm/std.py", line 1450, in format_dict
TypeError: cannot unpack non-iterable NoneType object
Environment
Please provide some information about the used environment.
Env from the set up using source and not docker
Cmd : nndet_train 1000 --sweep
It seems the issue is related to the fact that TensorMetric not updated to cuda device. The same issue as adressed on Lightning-AI/pytorch-lightning#2274.
The text was updated successfully, but these errors were encountered: