Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NaN losses during training! #86

Closed
amirhfarzaneh opened this issue May 12, 2017 · 26 comments
Closed

NaN losses during training! #86

amirhfarzaneh opened this issue May 12, 2017 · 26 comments

Comments

@amirhfarzaneh
Copy link

I'm following the exact same instructions for training, but during training with the command
./experiments/scripts/train_faster_rcnn.sh 0 pascal_voc vgg16

+ set -e
+ export PYTHONUNBUFFERED=True
+ PYTHONUNBUFFERED=True
+ GPU_ID=0
+ DATASET=pascal_voc
+ NET=vgg16
+ array=($@)
+ len=3
+ EXTRA_ARGS=
+ EXTRA_ARGS_SLUG=
+ case ${DATASET} in
+ TRAIN_IMDB=voc_2007_trainval
+ TEST_IMDB=voc_2007_test
+ STEPSIZE=50000
+ ITERS=70000
+ ANCHORS='[8,16,32]'
+ RATIOS='[0.5,1,2]'
++ date +%Y-%m-%d_%H-%M-%S
+ LOG=experiments/logs/vgg16_voc_2007_trainval__vgg16.txt.2017-05-11_18-12-08
+ exec
++ tee -a experiments/logs/vgg16_voc_2007_trainval__vgg16.txt.2017-05-11_18-12-08
+ echo Logging output to experiments/logs/vgg16_voc_2007_trainval__vgg16.txt.2017-05-11_18-12-08
Logging output to experiments/logs/vgg16_voc_2007_trainval__vgg16.txt.2017-05-11_18-12-08
+ set +x
+ '[' '!' -f output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt.index ']'
+ [[ ! -z '' ]]
+ CUDA_VISIBLE_DEVICES=0
+ time python ./tools/trainval_net.py --weight data/imagenet_weights/vgg16.ckpt --imdb voc_2007_trainval --imdbval voc_2007_test --iters 70000 --cfg experiments/cfgs/vgg16.yml --net vgg16 --set ANCHOR_SCALES '[8,16,32]' ANCHOR_RATIOS '[0.5,1,2]' TRAIN.STEPSIZE 50000
Called with args:
Namespace(cfg_file='experiments/cfgs/vgg16.yml', imdb_name='voc_2007_trainval', imdbval_name='voc_2007_test', max_iters=70000, net='vgg16', set_cfgs=['ANCHOR_SCALES', '[8,16,32]', 'ANCHOR_RATIOS', '[0.5,1,2]', 'TRAIN.STEPSIZE', '50000'], tag=None, weight='data/imagenet_weights/vgg16.ckpt')
Using config:
{'ANCHOR_RATIOS': [0.5, 1, 2],
 'ANCHOR_SCALES': [8, 16, 32],
 'DATA_DIR': '/home/amirhf/Projects/tf-faster-rcnn/data',
 'DEDUP_BOXES': 0.0625,
 'EPS': 1e-14,
 'EXP_DIR': 'vgg16',
 'GPU_ID': 0,
 'MATLAB': 'matlab',
 'PIXEL_MEANS': array([[[ 102.9801,  115.9465,  122.7717]]]),
 'POOLING_MODE': 'crop',
 'POOLING_SIZE': 7,
 'RESNET': {'BN_TRAIN': False, 'FIXED_BLOCKS': 1, 'MAX_POOL': False},
 'RNG_SEED': 3,
 'ROOT_DIR': '/home/amirhf/Projects/tf-faster-rcnn',
 'TEST': {'BBOX_REG': True,
          'HAS_RPN': True,
          'MAX_SIZE': 1000,
          'MODE': 'nms',
          'NMS': 0.3,
          'PROPOSAL_METHOD': 'gt',
          'RPN_NMS_THRESH': 0.7,
          'RPN_POST_NMS_TOP_N': 300,
          'RPN_PRE_NMS_TOP_N': 6000,
          'RPN_TOP_N': 5000,
          'SCALES': [600],
          'SVM': False},
 'TRAIN': {'ASPECT_GROUPING': False,
           'BATCH_SIZE': 256,
           'BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0],
           'BBOX_NORMALIZE_MEANS': [0.0, 0.0, 0.0, 0.0],
           'BBOX_NORMALIZE_STDS': [0.1, 0.1, 0.2, 0.2],
           'BBOX_NORMALIZE_TARGETS': True,
           'BBOX_NORMALIZE_TARGETS_PRECOMPUTED': True,
           'BBOX_REG': True,
           'BBOX_THRESH': 0.5,
           'BG_THRESH_HI': 0.5,
           'BG_THRESH_LO': 0.0,
           'BIAS_DECAY': False,
           'DISPLAY': 20,
           'DOUBLE_BIAS': True,
           'FG_FRACTION': 0.25,
           'FG_THRESH': 0.5,
           'GAMMA': 0.1,
           'HAS_RPN': True,
           'IMS_PER_BATCH': 1,
           'LEARNING_RATE': 0.001,
           'MAX_SIZE': 1000,
           'MOMENTUM': 0.9,
           'PROPOSAL_METHOD': 'gt',
           'RPN_BATCHSIZE': 256,
           'RPN_BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0],
           'RPN_CLOBBER_POSITIVES': False,
           'RPN_FG_FRACTION': 0.5,
           'RPN_NEGATIVE_OVERLAP': 0.3,
           'RPN_NMS_THRESH': 0.7,
           'RPN_POSITIVE_OVERLAP': 0.7,
           'RPN_POSITIVE_WEIGHT': -1.0,
           'RPN_POST_NMS_TOP_N': 2000,
           'RPN_PRE_NMS_TOP_N': 12000,
           'SCALES': [600],
           'SNAPSHOT_ITERS': 5000,
           'SNAPSHOT_KEPT': 3,
           'SNAPSHOT_PREFIX': 'vgg16_faster_rcnn',
           'STEPSIZE': 50000,
           'SUMMARY_INTERVAL': 180,
           'TRUNCATED': False,
           'USE_ALL_GT': True,
           'USE_FLIPPED': True,
           'USE_GT': False,
           'WEIGHT_DECAY': 0.0005},
 'USE_GPU_NMS': True}
Loaded dataset `voc_2007_trainval` for training
Set proposal method: gt
Appending horizontally-flipped training examples...
wrote gt roidb to /home/amirhf/Projects/tf-faster-rcnn/data/cache/voc_2007_trainval_gt_roidb.pkl
done
Preparing training data...
done
10022 roidb entries
Output will be saved to `/home/amirhf/Projects/tf-faster-rcnn/output/vgg16/voc_2007_trainval/default`
TensorFlow summaries will be saved to `/home/amirhf/Projects/tf-faster-rcnn/tensorboard/vgg16/voc_2007_trainval/default`
Loaded dataset `voc_2007_test` for training
Set proposal method: gt
Preparing training data...
wrote gt roidb to /home/amirhf/Projects/tf-faster-rcnn/data/cache/voc_2007_test_gt_roidb.pkl
done
4952 validation roidb entries
Filtered 0 roidb entries: 10022 -> 10022
Filtered 0 roidb entries: 4952 -> 4952
2017-05-11 18:12:37.107319: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-05-11 18:12:37.107338: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-05-11 18:12:37.107344: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-05-11 18:12:37.107350: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-05-11 18:12:37.404484: I tensorflow/core/common_runtime/gpu/gpu_device.cc:887] Found device 0 with properties: 
name: GeForce GTX 980 Ti
major: 5 minor: 2 memoryClockRate (GHz) 1.291
pciBusID 0000:01:00.0
Total memory: 5.93GiB
Free memory: 5.27GiB
2017-05-11 18:12:37.404517: I tensorflow/core/common_runtime/gpu/gpu_device.cc:908] DMA: 0 
2017-05-11 18:12:37.404523: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 0:   Y 
2017-05-11 18:12:37.404537: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 980 Ti, pci bus id: 0000:01:00.0)
Solving...
/home/amirhf/.local/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py:93: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
Loading initial model weights from data/imagenet_weights/vgg16.ckpt
Varibles restored: vgg_16/conv1/conv1_1/biases:0
Varibles restored: vgg_16/conv1/conv1_2/weights:0
Varibles restored: vgg_16/conv1/conv1_2/biases:0
Varibles restored: vgg_16/conv2/conv2_1/weights:0
Varibles restored: vgg_16/conv2/conv2_1/biases:0
Varibles restored: vgg_16/conv2/conv2_2/weights:0
Varibles restored: vgg_16/conv2/conv2_2/biases:0
Varibles restored: vgg_16/conv3/conv3_1/weights:0
Varibles restored: vgg_16/conv3/conv3_1/biases:0
Varibles restored: vgg_16/conv3/conv3_2/weights:0
Varibles restored: vgg_16/conv3/conv3_2/biases:0
Varibles restored: vgg_16/conv3/conv3_3/weights:0
Varibles restored: vgg_16/conv3/conv3_3/biases:0
Varibles restored: vgg_16/conv4/conv4_1/weights:0
Varibles restored: vgg_16/conv4/conv4_1/biases:0
Varibles restored: vgg_16/conv4/conv4_2/weights:0
Varibles restored: vgg_16/conv4/conv4_2/biases:0
Varibles restored: vgg_16/conv4/conv4_3/weights:0
Varibles restored: vgg_16/conv4/conv4_3/biases:0
Varibles restored: vgg_16/conv5/conv5_1/weights:0
Varibles restored: vgg_16/conv5/conv5_1/biases:0
Varibles restored: vgg_16/conv5/conv5_2/weights:0
Varibles restored: vgg_16/conv5/conv5_2/biases:0
Varibles restored: vgg_16/conv5/conv5_3/weights:0
Varibles restored: vgg_16/conv5/conv5_3/biases:0
Varibles restored: vgg_16/fc6/biases:0
Varibles restored: vgg_16/fc7/biases:0
Loaded.
Fix VGG16 layers..
iter: 20 / 70000, total loss: 1.780578
 >>> rpn_loss_cls: 0.331266
 >>> rpn_loss_box: 0.058807
 >>> loss_cls: 0.851354
 >>> loss_box: 0.539151
 >>> lr: 0.001000
speed: 0.908s / iter
iter: 40 / 70000, total loss: 0.701749
 >>> rpn_loss_cls: 0.551406
 >>> rpn_loss_box: 0.128653
 >>> loss_cls: 0.021690
 >>> loss_box: 0.000000
 >>> lr: 0.001000
.
.  [REMOVED LINES TO MAKE THE POST SHORTER]
.
.
iter: 3380 / 70000, total loss: 0.616202
 >>> rpn_loss_cls: 0.100265
 >>> rpn_loss_box: 0.145635
 >>> loss_cls: 0.185931
 >>> loss_box: 0.184371
 >>> lr: 0.001000
speed: 0.433s / iter
iter: 3400 / 70000, total loss: 1.312786
 >>> rpn_loss_cls: 0.295694
 >>> rpn_loss_box: 0.017820
 >>> loss_cls: 0.452280
 >>> loss_box: 0.546992
 >>> lr: 0.001000
speed: 0.432s / iter
iter: 3420 / 70000, total loss: 0.642559
 >>> rpn_loss_cls: 0.132440
 >>> rpn_loss_box: 0.039820
 >>> loss_cls: 0.293447
 >>> loss_box: 0.176852
 >>> lr: 0.001000
speed: 0.431s / iter
/home/amirhf/Projects/tf-faster-rcnn/tools/../lib/model/bbox_transform.py:56: RuntimeWarning: invalid value encountered in subtract
  pred_boxes[:, 0::4] = pred_ctr_x - 0.5 * pred_w
/home/amirhf/Projects/tf-faster-rcnn/tools/../lib/model/bbox_transform.py:58: RuntimeWarning: invalid value encountered in subtract
  pred_boxes[:, 1::4] = pred_ctr_y - 0.5 * pred_h
/home/amirhf/Projects/tf-faster-rcnn/tools/../lib/model/bbox_transform.py:60: RuntimeWarning: invalid value encountered in add
  pred_boxes[:, 2::4] = pred_ctr_x + 0.5 * pred_w
/home/amirhf/Projects/tf-faster-rcnn/tools/../lib/model/bbox_transform.py:62: RuntimeWarning: invalid value encountered in add
  pred_boxes[:, 3::4] = pred_ctr_y + 0.5 * pred_h
iter: 3440 / 70000, total loss: nan
 >>> rpn_loss_cls: nan
 >>> rpn_loss_box: nan
 >>> loss_cls: nan
 >>> loss_box: nan
 >>> lr: 0.001000

There are those

RuntimeWarning: invalid value encountered in subtract
  pred_boxes[:, 0::4] = pred_ctr_x - 0.5 * pred_w

errors and from there, losses become nan! I have changed nothing in the files!

@endernewton
Copy link
Owner

did you try testing? did you get the same number?

@amirhfarzaneh
Copy link
Author

It doesn't go through testing phase! after all the losses getting nans, it finishes like this:

iter: 3760 / 70000, total loss: nan
 >>> rpn_loss_cls: nan
 >>> rpn_loss_box: nan
 >>> loss_cls: nan
 >>> loss_box: nan
 >>> lr: 0.001000
speed: 0.430s / iter
2017-05-11 18:39:44.836200: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv5/conv5_2/biases
	 [[Node: TRAIN/vgg_16/conv5/conv5_2/biases = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv5/conv5_2/biases/tag, vgg_16/conv5/conv5_2/biases/read/_257)]]
2017-05-11 18:39:44.836202: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv5/conv5_2/biases
	 [[Node: TRAIN/vgg_16/conv5/conv5_2/biases = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv5/conv5_2/biases/tag, vgg_16/conv5/conv5_2/biases/read/_257)]]
2017-05-11 18:39:44.837950: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv5/conv5_2/biases
	 [[Node: TRAIN/vgg_16/conv5/conv5_2/biases = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv5/conv5_2/biases/tag, vgg_16/conv5/conv5_2/biases/read/_257)]]
2017-05-11 18:39:44.838161: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv5/conv5_2/biases
	 [[Node: TRAIN/vgg_16/conv5/conv5_2/biases = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv5/conv5_2/biases/tag, vgg_16/conv5/conv5_2/biases/read/_257)]]
2017-05-11 18:39:44.838203: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv5/conv5_2/biases
	 [[Node: TRAIN/vgg_16/conv5/conv5_2/biases = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv5/conv5_2/biases/tag, vgg_16/conv5/conv5_2/biases/read/_257)]]
2017-05-11 18:39:44.838346: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv5/conv5_2/biases
	 [[Node: TRAIN/vgg_16/conv5/conv5_2/biases = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv5/conv5_2/biases/tag, vgg_16/conv5/conv5_2/biases/read/_257)]]
2017-05-11 18:39:44.838614: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv5/conv5_2/biases
	 [[Node: TRAIN/vgg_16/conv5/conv5_2/biases = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv5/conv5_2/biases/tag, vgg_16/conv5/conv5_2/biases/read/_257)]]
2017-05-11 18:39:44.838676: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv5/conv5_2/biases
	 [[Node: TRAIN/vgg_16/conv5/conv5_2/biases = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv5/conv5_2/biases/tag, vgg_16/conv5/conv5_2/biases/read/_257)]]
2017-05-11 18:39:44.838770: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv5/conv5_2/biases
	 [[Node: TRAIN/vgg_16/conv5/conv5_2/biases = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv5/conv5_2/biases/tag, vgg_16/conv5/conv5_2/biases/read/_257)]]
2017-05-11 18:39:44.838976: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv5/conv5_2/biases
	 [[Node: TRAIN/vgg_16/conv5/conv5_2/biases = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv5/conv5_2/biases/tag, vgg_16/conv5/conv5_2/biases/read/_257)]]
2017-05-11 18:39:44.918997: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv5/conv5_2/biases
	 [[Node: TRAIN/vgg_16/conv5/conv5_2/biases = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv5/conv5_2/biases/tag, vgg_16/conv5/conv5_2/biases/read/_257)]]
Traceback (most recent call last):
  File "./tools/trainval_net.py", line 136, in <module>
    max_iters=args.max_iters)
  File "/home/amirhf/Projects/tf-faster-rcnn/tools/../lib/model/train_val.py", line 381, in train_net
    sw.train_model(sess, max_iters)
  File "/home/amirhf/Projects/tf-faster-rcnn/tools/../lib/model/train_val.py", line 270, in train_model
    self.net.train_step_with_summary(sess, blobs, train_op)
  File "/home/amirhf/Projects/tf-faster-rcnn/tools/../lib/nets/network.py", line 387, in train_step_with_summary
    feed_dict=feed_dict)
  File "/home/amirhf/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 778, in run
    run_metadata_ptr)
  File "/home/amirhf/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 982, in _run
    feed_dict_string, options, run_metadata)
  File "/home/amirhf/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1032, in _do_run
    target_list, options, run_metadata)
  File "/home/amirhf/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1052, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Nan in summary histogram for: TRAIN/vgg_16/conv5/conv5_2/biases
	 [[Node: TRAIN/vgg_16/conv5/conv5_2/biases = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv5/conv5_2/biases/tag, vgg_16/conv5/conv5_2/biases/read/_257)]]
	 [[Node: gradients/loss_default/mul_grad/Shape/_313 = _HostRecv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_1614_gradients/loss_default/mul_grad/Shape", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]

Caused by op u'TRAIN/vgg_16/conv5/conv5_2/biases', defined at:
  File "./tools/trainval_net.py", line 136, in <module>
    max_iters=args.max_iters)
  File "/home/amirhf/Projects/tf-faster-rcnn/tools/../lib/model/train_val.py", line 381, in train_net
    sw.train_model(sess, max_iters)
  File "/home/amirhf/Projects/tf-faster-rcnn/tools/../lib/model/train_val.py", line 105, in train_model
    anchor_ratios=cfg.ANCHOR_RATIOS)
  File "/home/amirhf/Projects/tf-faster-rcnn/tools/../lib/nets/network.py", line 332, in create_architecture
    self._add_train_summary(var)
  File "/home/amirhf/Projects/tf-faster-rcnn/tools/../lib/nets/network.py", line 71, in _add_train_summary
    tf.summary.histogram('TRAIN/' + var.op.name, var)
  File "/home/amirhf/.local/lib/python2.7/site-packages/tensorflow/python/summary/summary.py", line 209, in histogram
    tag=scope.rstrip('/'), values=values, name=scope)
  File "/home/amirhf/.local/lib/python2.7/site-packages/tensorflow/python/ops/gen_logging_ops.py", line 139, in _histogram_summary
    name=name)
  File "/home/amirhf/.local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 768, in apply_op
    op_def=op_def)
  File "/home/amirhf/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2336, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/home/amirhf/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1228, in __init__
    self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): Nan in summary histogram for: TRAIN/vgg_16/conv5/conv5_2/biases
	 [[Node: TRAIN/vgg_16/conv5/conv5_2/biases = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv5/conv5_2/biases/tag, vgg_16/conv5/conv5_2/biases/read/_257)]]
	 [[Node: gradients/loss_default/mul_grad/Shape/_313 = _HostRecv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_1614_gradients/loss_default/mul_grad/Shape", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]

Command exited with non-zero status 1
1316.02user 363.43system 27:37.30elapsed 101%CPU (0avgtext+0avgdata 3572892maxresident)k
202408inputs+33336outputs (13major+5238801minor)pagefaults 0swaps

@endernewton
Copy link
Owner

no, i mean did you try testing with the pre-trained model i released?

@amirhfarzaneh
Copy link
Author

Yes, and it worked and showed the detected bounding boxes correctly

@endernewton
Copy link
Owner

so the same number you can get with 78.7?

@amirhfarzaneh
Copy link
Author

amirhfarzaneh commented May 12, 2017

I'm getting these results for pascal_voc2007 trainval with vgg16
AP for aeroplane = 0.6895
AP for bicycle = 0.7835
AP for bird = 0.6753
AP for boat = 0.5338
AP for bottle = 0.5864
AP for bus = 0.7863
AP for car = 0.8411
AP for cat = 0.8395
AP for chair = 0.4778
AP for cow = 0.8139
AP for diningtable = 0.6685
AP for dog = 0.8073
AP for horse = 0.8407
AP for motorbike = 0.7558
AP for person = 0.7715
AP for pottedplant = 0.4624
AP for sheep = 0.7073
AP for sofa = 0.6700
AP for train = 0.7418
AP for tvmonitor = 0.7315
Mean AP = 0.7092

@endernewton
Copy link
Owner

hmm this is right.. it maybe the case that 980 is not big enough to support gpu nms and 256 batch size during training, you may need some way to go over that

@amirhfarzaneh
Copy link
Author

do you think disabling gpu nms will help? how do I do that?
there are two batch sizes if I'm not mistaken! should I change those to 128? What are the name of the variables for the batch sizes?

@amirhfarzaneh
Copy link
Author

@endernewton The person in issue#8 also has the same problem and she's using a K40!

@endernewton
Copy link
Owner

@amirhfarzaneh i guess later she figure it out and the error was not nan in training

@amirhfarzaneh
Copy link
Author

amirhfarzaneh commented May 13, 2017

@endernewton Could you please share your log files for training? Especially for the voc_2007_trainval dataset with vgg16 architecture? I think this will be useful to others too. This way we can compare some statistics while we're training, like how the loss numbers should look like! Thank you in advance

@endernewton
Copy link
Owner

@amirhfarzaneh the original one is lost. Let me see if I can retrain to get a similar log.

@endernewton
Copy link
Owner

@dancsalo
Copy link

@endernewton the link to the log file you posted appears to be broken.

I just ran the res101 model with gpu_nms and with cpu_nms. gpu_nms gave me NaN's during training; cpu_nms gave me the expected results. I am using one Titan Xp (compute capability 6.1) and configured the setup.py with 'sm_61,' following the README. Is this expected behavior?

Perhaps the OP would get expected results if they used the cpu_nms...

@endernewton
Copy link
Owner

endernewton commented May 21, 2017 via email

@dancsalo
Copy link

@endernewton I re-ran the res101 model with gpu_nms and configured the setup.py with 'sm_52'. No NaNs, but I only got 0.65 mAP. I am going to re-run and see what the variance is.

@endernewton
Copy link
Owner

endernewton commented May 22, 2017 via email

@endernewton
Copy link
Owner

@dancsalo maybe because you have the Xp. the code needs to get some modifications to work on more recent gpus i guess. i haven't got access to such gpus yet so i cannot help much.

@amirhfarzaneh
Copy link
Author

amirhfarzaneh commented May 22, 2017

Seems like the NaN problem occurs only on some gpus. I have a GTX 980Ti and NaN happens. I have tested the code on a Quadro M4000 and GTX 1080 and NaNs don't appear and the training goes as it should! This is my log file on a 1080Ti : https://drive.google.com/file/d/0Bz-CTQRw0GZCeTNrcjZ0OFVXRWs/view?usp=sharing

@zdm123
Copy link

zdm123 commented Sep 28, 2017

@amirhfarzaneh Hello, my gpu is Tesla K40c. I also meet the NaN problem, do you know how to fix it ?

@aaa135511
Copy link

Hi, anyone who can tell me that it has the same effect and result with the tf_test_faster_rcnn.sh when I run the tf_train_faster_rcnn.sh .it means that the train shell didn't work at all. thanks much

@nassarofficial
Copy link

I had this error and the only fix was that I had problems in my xml annotation files, some were empty, and some bboxes had negative values. After eliminating them the error disappeared.

@guojiapeng00
Copy link

I had this error ,too,
today, I make it!!!!!
I find there are lots of boxes outside of my pics.
for example, my pics are 600*600,but there is a box (550,550,650,650)
when i delete these pics in trainval.txt, it works!!!

@henbucuoshanghai
Copy link

mg picus is 1280*960 it is too big ? does it matter???? your py will resize it ???

@mengce97
Copy link

mengce97 commented Jul 1, 2019

hello, did u fix it?
I meet the same error and i tries it all day with no help.
If you know why it happens, please tell me, i will be very appretriate!

@nassarofficial
Copy link

hello, did u fix it?
I meet the same error and i tries it all day with no help.
If you know why it happens, please tell me, i will be very appretriate!
I had this error and the only fix was that I had problems in my xml annotation files, some were empty, and some bboxes had negative values. After eliminating them the error disappeared.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants