-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NaN losses during training! #86
Comments
did you try testing? did you get the same number? |
It doesn't go through testing phase! after all the losses getting nans, it finishes like this:
|
no, i mean did you try testing with the pre-trained model i released? |
Yes, and it worked and showed the detected bounding boxes correctly |
so the same number you can get with 78.7? |
I'm getting these results for pascal_voc2007 trainval with vgg16 |
hmm this is right.. it maybe the case that 980 is not big enough to support gpu nms and 256 batch size during training, you may need some way to go over that |
do you think disabling gpu nms will help? how do I do that? |
@endernewton The person in issue#8 also has the same problem and she's using a K40! |
@amirhfarzaneh i guess later she figure it out and the error was not nan in training |
@endernewton Could you please share your log files for training? Especially for the voc_2007_trainval dataset with vgg16 architecture? I think this will be useful to others too. This way we can compare some statistics while we're training, like how the loss numbers should look like! Thank you in advance |
@amirhfarzaneh the original one is lost. Let me see if I can retrain to get a similar log. |
i have put up a log file at http://gs11655.sp.cs.cmu.edu/xinleic/tf-faster-rcnn/logs/vgg16_voc_2007_trainval__vgg16.txt.2017-05-14_19-26-27 |
@endernewton the link to the log file you posted appears to be broken. I just ran the res101 model with gpu_nms and with cpu_nms. gpu_nms gave me NaN's during training; cpu_nms gave me the expected results. I am using one Titan Xp (compute capability 6.1) and configured the setup.py with 'sm_61,' following the README. Is this expected behavior? Perhaps the OP would get expected results if they used the cpu_nms... |
Wow nice! On my side I am actually using arch_52 for both pascal and non pascal gpus, just another data point to make it work.
The web server is not stable for some reason. I can move that log to google drive later.
…Sent from my iPhone
On May 21, 2017, at 11:54, Dan Salo ***@***.***> wrote:
@endernewton the link to the log file you posted appears to be broken.
I just ran the res101 model with gpu_nms and with cpu_nms. gpu_nms gave me NaN's during training; cpu_nms gave me the expected results. I am using one Titan Xp (compute capability 6.1) and configured the setup.py with 'sm_61,' following the README. Is this expected behavior?
Perhaps the OP would get expected results if they used the cpu_nms...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
@endernewton I re-ran the res101 model with gpu_nms and configured the setup.py with 'sm_52'. No NaNs, but I only got 0.65 mAP. I am going to re-run and see what the variance is. |
No 0.65 is too low.. hmm then this problem is still hidden. Did you do testing with the provided models? What map did you get?
…Sent from my iPhone
On May 22, 2017, at 10:35, Dan Salo ***@***.***> wrote:
@endernewton I re-ran the res101 model with gpu_nms and configured the setup.py with 'sm_52'. No NaNs, but I only got 0.65 mAP. I am going to re-run and see what the variance is.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
@dancsalo maybe because you have the Xp. the code needs to get some modifications to work on more recent gpus i guess. i haven't got access to such gpus yet so i cannot help much. |
Seems like the NaN problem occurs only on some gpus. I have a GTX 980Ti and NaN happens. I have tested the code on a Quadro M4000 and GTX 1080 and NaNs don't appear and the training goes as it should! This is my log file on a 1080Ti : https://drive.google.com/file/d/0Bz-CTQRw0GZCeTNrcjZ0OFVXRWs/view?usp=sharing |
@amirhfarzaneh Hello, my gpu is Tesla K40c. I also meet the NaN problem, do you know how to fix it ? |
Hi, anyone who can tell me that it has the same effect and result with the tf_test_faster_rcnn.sh when I run the tf_train_faster_rcnn.sh .it means that the train shell didn't work at all. thanks much |
I had this error and the only fix was that I had problems in my xml annotation files, some were empty, and some bboxes had negative values. After eliminating them the error disappeared. |
I had this error ,too, |
mg picus is 1280*960 it is too big ? does it matter???? your py will resize it ??? |
hello, did u fix it? |
|
I'm following the exact same instructions for training, but during training with the command
./experiments/scripts/train_faster_rcnn.sh 0 pascal_voc vgg16
There are those
errors and from there, losses become nan! I have changed nothing in the files!
The text was updated successfully, but these errors were encountered: