NaN losses during training! #86

amirhfarzaneh · 2017-05-12T05:55:04Z

I'm following the exact same instructions for training, but during training with the command
./experiments/scripts/train_faster_rcnn.sh 0 pascal_voc vgg16

+ set -e
+ export PYTHONUNBUFFERED=True
+ PYTHONUNBUFFERED=True
+ GPU_ID=0
+ DATASET=pascal_voc
+ NET=vgg16
+ array=($@)
+ len=3
+ EXTRA_ARGS=
+ EXTRA_ARGS_SLUG=
+ case ${DATASET} in
+ TRAIN_IMDB=voc_2007_trainval
+ TEST_IMDB=voc_2007_test
+ STEPSIZE=50000
+ ITERS=70000
+ ANCHORS='[8,16,32]'
+ RATIOS='[0.5,1,2]'
++ date +%Y-%m-%d_%H-%M-%S
+ LOG=experiments/logs/vgg16_voc_2007_trainval__vgg16.txt.2017-05-11_18-12-08
+ exec
++ tee -a experiments/logs/vgg16_voc_2007_trainval__vgg16.txt.2017-05-11_18-12-08
+ echo Logging output to experiments/logs/vgg16_voc_2007_trainval__vgg16.txt.2017-05-11_18-12-08
Logging output to experiments/logs/vgg16_voc_2007_trainval__vgg16.txt.2017-05-11_18-12-08
+ set +x
+ '[' '!' -f output/vgg16/voc_2007_trainval/default/vgg16_faster_rcnn_iter_70000.ckpt.index ']'
+ [[ ! -z '' ]]
+ CUDA_VISIBLE_DEVICES=0
+ time python ./tools/trainval_net.py --weight data/imagenet_weights/vgg16.ckpt --imdb voc_2007_trainval --imdbval voc_2007_test --iters 70000 --cfg experiments/cfgs/vgg16.yml --net vgg16 --set ANCHOR_SCALES '[8,16,32]' ANCHOR_RATIOS '[0.5,1,2]' TRAIN.STEPSIZE 50000
Called with args:
Namespace(cfg_file='experiments/cfgs/vgg16.yml', imdb_name='voc_2007_trainval', imdbval_name='voc_2007_test', max_iters=70000, net='vgg16', set_cfgs=['ANCHOR_SCALES', '[8,16,32]', 'ANCHOR_RATIOS', '[0.5,1,2]', 'TRAIN.STEPSIZE', '50000'], tag=None, weight='data/imagenet_weights/vgg16.ckpt')
Using config:
{'ANCHOR_RATIOS': [0.5, 1, 2],
 'ANCHOR_SCALES': [8, 16, 32],
 'DATA_DIR': '/home/amirhf/Projects/tf-faster-rcnn/data',
 'DEDUP_BOXES': 0.0625,
 'EPS': 1e-14,
 'EXP_DIR': 'vgg16',
 'GPU_ID': 0,
 'MATLAB': 'matlab',
 'PIXEL_MEANS': array([[[ 102.9801,  115.9465,  122.7717]]]),
 'POOLING_MODE': 'crop',
 'POOLING_SIZE': 7,
 'RESNET': {'BN_TRAIN': False, 'FIXED_BLOCKS': 1, 'MAX_POOL': False},
 'RNG_SEED': 3,
 'ROOT_DIR': '/home/amirhf/Projects/tf-faster-rcnn',
 'TEST': {'BBOX_REG': True,
          'HAS_RPN': True,
          'MAX_SIZE': 1000,
          'MODE': 'nms',
          'NMS': 0.3,
          'PROPOSAL_METHOD': 'gt',
          'RPN_NMS_THRESH': 0.7,
          'RPN_POST_NMS_TOP_N': 300,
          'RPN_PRE_NMS_TOP_N': 6000,
          'RPN_TOP_N': 5000,
          'SCALES': [600],
          'SVM': False},
 'TRAIN': {'ASPECT_GROUPING': False,
           'BATCH_SIZE': 256,
           'BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0],
           'BBOX_NORMALIZE_MEANS': [0.0, 0.0, 0.0, 0.0],
           'BBOX_NORMALIZE_STDS': [0.1, 0.1, 0.2, 0.2],
           'BBOX_NORMALIZE_TARGETS': True,
           'BBOX_NORMALIZE_TARGETS_PRECOMPUTED': True,
           'BBOX_REG': True,
           'BBOX_THRESH': 0.5,
           'BG_THRESH_HI': 0.5,
           'BG_THRESH_LO': 0.0,
           'BIAS_DECAY': False,
           'DISPLAY': 20,
           'DOUBLE_BIAS': True,
           'FG_FRACTION': 0.25,
           'FG_THRESH': 0.5,
           'GAMMA': 0.1,
           'HAS_RPN': True,
           'IMS_PER_BATCH': 1,
           'LEARNING_RATE': 0.001,
           'MAX_SIZE': 1000,
           'MOMENTUM': 0.9,
           'PROPOSAL_METHOD': 'gt',
           'RPN_BATCHSIZE': 256,
           'RPN_BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0],
           'RPN_CLOBBER_POSITIVES': False,
           'RPN_FG_FRACTION': 0.5,
           'RPN_NEGATIVE_OVERLAP': 0.3,
           'RPN_NMS_THRESH': 0.7,
           'RPN_POSITIVE_OVERLAP': 0.7,
           'RPN_POSITIVE_WEIGHT': -1.0,
           'RPN_POST_NMS_TOP_N': 2000,
           'RPN_PRE_NMS_TOP_N': 12000,
           'SCALES': [600],
           'SNAPSHOT_ITERS': 5000,
           'SNAPSHOT_KEPT': 3,
           'SNAPSHOT_PREFIX': 'vgg16_faster_rcnn',
           'STEPSIZE': 50000,
           'SUMMARY_INTERVAL': 180,
           'TRUNCATED': False,
           'USE_ALL_GT': True,
           'USE_FLIPPED': True,
           'USE_GT': False,
           'WEIGHT_DECAY': 0.0005},
 'USE_GPU_NMS': True}
Loaded dataset `voc_2007_trainval` for training
Set proposal method: gt
Appending horizontally-flipped training examples...
wrote gt roidb to /home/amirhf/Projects/tf-faster-rcnn/data/cache/voc_2007_trainval_gt_roidb.pkl
done
Preparing training data...
done
10022 roidb entries
Output will be saved to `/home/amirhf/Projects/tf-faster-rcnn/output/vgg16/voc_2007_trainval/default`
TensorFlow summaries will be saved to `/home/amirhf/Projects/tf-faster-rcnn/tensorboard/vgg16/voc_2007_trainval/default`
Loaded dataset `voc_2007_test` for training
Set proposal method: gt
Preparing training data...
wrote gt roidb to /home/amirhf/Projects/tf-faster-rcnn/data/cache/voc_2007_test_gt_roidb.pkl
done
4952 validation roidb entries
Filtered 0 roidb entries: 10022 -> 10022
Filtered 0 roidb entries: 4952 -> 4952
2017-05-11 18:12:37.107319: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-05-11 18:12:37.107338: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-05-11 18:12:37.107344: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-05-11 18:12:37.107350: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-05-11 18:12:37.404484: I tensorflow/core/common_runtime/gpu/gpu_device.cc:887] Found device 0 with properties: 
name: GeForce GTX 980 Ti
major: 5 minor: 2 memoryClockRate (GHz) 1.291
pciBusID 0000:01:00.0
Total memory: 5.93GiB
Free memory: 5.27GiB
2017-05-11 18:12:37.404517: I tensorflow/core/common_runtime/gpu/gpu_device.cc:908] DMA: 0 
2017-05-11 18:12:37.404523: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 0:   Y 
2017-05-11 18:12:37.404537: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 980 Ti, pci bus id: 0000:01:00.0)
Solving...
/home/amirhf/.local/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py:93: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
Loading initial model weights from data/imagenet_weights/vgg16.ckpt
Varibles restored: vgg_16/conv1/conv1_1/biases:0
Varibles restored: vgg_16/conv1/conv1_2/weights:0
Varibles restored: vgg_16/conv1/conv1_2/biases:0
Varibles restored: vgg_16/conv2/conv2_1/weights:0
Varibles restored: vgg_16/conv2/conv2_1/biases:0
Varibles restored: vgg_16/conv2/conv2_2/weights:0
Varibles restored: vgg_16/conv2/conv2_2/biases:0
Varibles restored: vgg_16/conv3/conv3_1/weights:0
Varibles restored: vgg_16/conv3/conv3_1/biases:0
Varibles restored: vgg_16/conv3/conv3_2/weights:0
Varibles restored: vgg_16/conv3/conv3_2/biases:0
Varibles restored: vgg_16/conv3/conv3_3/weights:0
Varibles restored: vgg_16/conv3/conv3_3/biases:0
Varibles restored: vgg_16/conv4/conv4_1/weights:0
Varibles restored: vgg_16/conv4/conv4_1/biases:0
Varibles restored: vgg_16/conv4/conv4_2/weights:0
Varibles restored: vgg_16/conv4/conv4_2/biases:0
Varibles restored: vgg_16/conv4/conv4_3/weights:0
Varibles restored: vgg_16/conv4/conv4_3/biases:0
Varibles restored: vgg_16/conv5/conv5_1/weights:0
Varibles restored: vgg_16/conv5/conv5_1/biases:0
Varibles restored: vgg_16/conv5/conv5_2/weights:0
Varibles restored: vgg_16/conv5/conv5_2/biases:0
Varibles restored: vgg_16/conv5/conv5_3/weights:0
Varibles restored: vgg_16/conv5/conv5_3/biases:0
Varibles restored: vgg_16/fc6/biases:0
Varibles restored: vgg_16/fc7/biases:0
Loaded.
Fix VGG16 layers..
iter: 20 / 70000, total loss: 1.780578
 >>> rpn_loss_cls: 0.331266
 >>> rpn_loss_box: 0.058807
 >>> loss_cls: 0.851354
 >>> loss_box: 0.539151
 >>> lr: 0.001000
speed: 0.908s / iter
iter: 40 / 70000, total loss: 0.701749
 >>> rpn_loss_cls: 0.551406
 >>> rpn_loss_box: 0.128653
 >>> loss_cls: 0.021690
 >>> loss_box: 0.000000
 >>> lr: 0.001000
.
.  [REMOVED LINES TO MAKE THE POST SHORTER]
.
.
iter: 3380 / 70000, total loss: 0.616202
 >>> rpn_loss_cls: 0.100265
 >>> rpn_loss_box: 0.145635
 >>> loss_cls: 0.185931
 >>> loss_box: 0.184371
 >>> lr: 0.001000
speed: 0.433s / iter
iter: 3400 / 70000, total loss: 1.312786
 >>> rpn_loss_cls: 0.295694
 >>> rpn_loss_box: 0.017820
 >>> loss_cls: 0.452280
 >>> loss_box: 0.546992
 >>> lr: 0.001000
speed: 0.432s / iter
iter: 3420 / 70000, total loss: 0.642559
 >>> rpn_loss_cls: 0.132440
 >>> rpn_loss_box: 0.039820
 >>> loss_cls: 0.293447
 >>> loss_box: 0.176852
 >>> lr: 0.001000
speed: 0.431s / iter
/home/amirhf/Projects/tf-faster-rcnn/tools/../lib/model/bbox_transform.py:56: RuntimeWarning: invalid value encountered in subtract
  pred_boxes[:, 0::4] = pred_ctr_x - 0.5 * pred_w
/home/amirhf/Projects/tf-faster-rcnn/tools/../lib/model/bbox_transform.py:58: RuntimeWarning: invalid value encountered in subtract
  pred_boxes[:, 1::4] = pred_ctr_y - 0.5 * pred_h
/home/amirhf/Projects/tf-faster-rcnn/tools/../lib/model/bbox_transform.py:60: RuntimeWarning: invalid value encountered in add
  pred_boxes[:, 2::4] = pred_ctr_x + 0.5 * pred_w
/home/amirhf/Projects/tf-faster-rcnn/tools/../lib/model/bbox_transform.py:62: RuntimeWarning: invalid value encountered in add
  pred_boxes[:, 3::4] = pred_ctr_y + 0.5 * pred_h
iter: 3440 / 70000, total loss: nan
 >>> rpn_loss_cls: nan
 >>> rpn_loss_box: nan
 >>> loss_cls: nan
 >>> loss_box: nan
 >>> lr: 0.001000

There are those

RuntimeWarning: invalid value encountered in subtract
  pred_boxes[:, 0::4] = pred_ctr_x - 0.5 * pred_w

errors and from there, losses become nan! I have changed nothing in the files!

The text was updated successfully, but these errors were encountered:

endernewton · 2017-05-12T06:09:13Z

did you try testing? did you get the same number?

amirhfarzaneh · 2017-05-12T06:11:46Z

It doesn't go through testing phase! after all the losses getting nans, it finishes like this:

iter: 3760 / 70000, total loss: nan
 >>> rpn_loss_cls: nan
 >>> rpn_loss_box: nan
 >>> loss_cls: nan
 >>> loss_box: nan
 >>> lr: 0.001000
speed: 0.430s / iter
2017-05-11 18:39:44.836200: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv5/conv5_2/biases
	 [[Node: TRAIN/vgg_16/conv5/conv5_2/biases = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv5/conv5_2/biases/tag, vgg_16/conv5/conv5_2/biases/read/_257)]]
2017-05-11 18:39:44.836202: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv5/conv5_2/biases
	 [[Node: TRAIN/vgg_16/conv5/conv5_2/biases = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv5/conv5_2/biases/tag, vgg_16/conv5/conv5_2/biases/read/_257)]]
2017-05-11 18:39:44.837950: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv5/conv5_2/biases
	 [[Node: TRAIN/vgg_16/conv5/conv5_2/biases = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv5/conv5_2/biases/tag, vgg_16/conv5/conv5_2/biases/read/_257)]]
2017-05-11 18:39:44.838161: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv5/conv5_2/biases
	 [[Node: TRAIN/vgg_16/conv5/conv5_2/biases = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv5/conv5_2/biases/tag, vgg_16/conv5/conv5_2/biases/read/_257)]]
2017-05-11 18:39:44.838203: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv5/conv5_2/biases
	 [[Node: TRAIN/vgg_16/conv5/conv5_2/biases = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv5/conv5_2/biases/tag, vgg_16/conv5/conv5_2/biases/read/_257)]]
2017-05-11 18:39:44.838346: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv5/conv5_2/biases
	 [[Node: TRAIN/vgg_16/conv5/conv5_2/biases = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv5/conv5_2/biases/tag, vgg_16/conv5/conv5_2/biases/read/_257)]]
2017-05-11 18:39:44.838614: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv5/conv5_2/biases
	 [[Node: TRAIN/vgg_16/conv5/conv5_2/biases = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv5/conv5_2/biases/tag, vgg_16/conv5/conv5_2/biases/read/_257)]]
2017-05-11 18:39:44.838676: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv5/conv5_2/biases
	 [[Node: TRAIN/vgg_16/conv5/conv5_2/biases = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv5/conv5_2/biases/tag, vgg_16/conv5/conv5_2/biases/read/_257)]]
2017-05-11 18:39:44.838770: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv5/conv5_2/biases
	 [[Node: TRAIN/vgg_16/conv5/conv5_2/biases = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv5/conv5_2/biases/tag, vgg_16/conv5/conv5_2/biases/read/_257)]]
2017-05-11 18:39:44.838976: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv5/conv5_2/biases
	 [[Node: TRAIN/vgg_16/conv5/conv5_2/biases = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv5/conv5_2/biases/tag, vgg_16/conv5/conv5_2/biases/read/_257)]]
2017-05-11 18:39:44.918997: W tensorflow/core/framework/op_kernel.cc:1152] Invalid argument: Nan in summary histogram for: TRAIN/vgg_16/conv5/conv5_2/biases
	 [[Node: TRAIN/vgg_16/conv5/conv5_2/biases = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv5/conv5_2/biases/tag, vgg_16/conv5/conv5_2/biases/read/_257)]]
Traceback (most recent call last):
  File "./tools/trainval_net.py", line 136, in <module>
    max_iters=args.max_iters)
  File "/home/amirhf/Projects/tf-faster-rcnn/tools/../lib/model/train_val.py", line 381, in train_net
    sw.train_model(sess, max_iters)
  File "/home/amirhf/Projects/tf-faster-rcnn/tools/../lib/model/train_val.py", line 270, in train_model
    self.net.train_step_with_summary(sess, blobs, train_op)
  File "/home/amirhf/Projects/tf-faster-rcnn/tools/../lib/nets/network.py", line 387, in train_step_with_summary
    feed_dict=feed_dict)
  File "/home/amirhf/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 778, in run
    run_metadata_ptr)
  File "/home/amirhf/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 982, in _run
    feed_dict_string, options, run_metadata)
  File "/home/amirhf/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1032, in _do_run
    target_list, options, run_metadata)
  File "/home/amirhf/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1052, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Nan in summary histogram for: TRAIN/vgg_16/conv5/conv5_2/biases
	 [[Node: TRAIN/vgg_16/conv5/conv5_2/biases = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv5/conv5_2/biases/tag, vgg_16/conv5/conv5_2/biases/read/_257)]]
	 [[Node: gradients/loss_default/mul_grad/Shape/_313 = _HostRecv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_1614_gradients/loss_default/mul_grad/Shape", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]

Caused by op u'TRAIN/vgg_16/conv5/conv5_2/biases', defined at:
  File "./tools/trainval_net.py", line 136, in <module>
    max_iters=args.max_iters)
  File "/home/amirhf/Projects/tf-faster-rcnn/tools/../lib/model/train_val.py", line 381, in train_net
    sw.train_model(sess, max_iters)
  File "/home/amirhf/Projects/tf-faster-rcnn/tools/../lib/model/train_val.py", line 105, in train_model
    anchor_ratios=cfg.ANCHOR_RATIOS)
  File "/home/amirhf/Projects/tf-faster-rcnn/tools/../lib/nets/network.py", line 332, in create_architecture
    self._add_train_summary(var)
  File "/home/amirhf/Projects/tf-faster-rcnn/tools/../lib/nets/network.py", line 71, in _add_train_summary
    tf.summary.histogram('TRAIN/' + var.op.name, var)
  File "/home/amirhf/.local/lib/python2.7/site-packages/tensorflow/python/summary/summary.py", line 209, in histogram
    tag=scope.rstrip('/'), values=values, name=scope)
  File "/home/amirhf/.local/lib/python2.7/site-packages/tensorflow/python/ops/gen_logging_ops.py", line 139, in _histogram_summary
    name=name)
  File "/home/amirhf/.local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 768, in apply_op
    op_def=op_def)
  File "/home/amirhf/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2336, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/home/amirhf/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1228, in __init__
    self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): Nan in summary histogram for: TRAIN/vgg_16/conv5/conv5_2/biases
	 [[Node: TRAIN/vgg_16/conv5/conv5_2/biases = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](TRAIN/vgg_16/conv5/conv5_2/biases/tag, vgg_16/conv5/conv5_2/biases/read/_257)]]
	 [[Node: gradients/loss_default/mul_grad/Shape/_313 = _HostRecv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_1614_gradients/loss_default/mul_grad/Shape", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]

Command exited with non-zero status 1
1316.02user 363.43system 27:37.30elapsed 101%CPU (0avgtext+0avgdata 3572892maxresident)k
202408inputs+33336outputs (13major+5238801minor)pagefaults 0swaps

endernewton · 2017-05-12T06:25:41Z

no, i mean did you try testing with the pre-trained model i released?

amirhfarzaneh · 2017-05-12T06:26:55Z

Yes, and it worked and showed the detected bounding boxes correctly

endernewton · 2017-05-12T06:28:02Z

so the same number you can get with 78.7?

amirhfarzaneh · 2017-05-12T06:54:38Z

I'm getting these results for pascal_voc2007 trainval with vgg16
AP for aeroplane = 0.6895
AP for bicycle = 0.7835
AP for bird = 0.6753
AP for boat = 0.5338
AP for bottle = 0.5864
AP for bus = 0.7863
AP for car = 0.8411
AP for cat = 0.8395
AP for chair = 0.4778
AP for cow = 0.8139
AP for diningtable = 0.6685
AP for dog = 0.8073
AP for horse = 0.8407
AP for motorbike = 0.7558
AP for person = 0.7715
AP for pottedplant = 0.4624
AP for sheep = 0.7073
AP for sofa = 0.6700
AP for train = 0.7418
AP for tvmonitor = 0.7315
Mean AP = 0.7092

endernewton · 2017-05-12T07:04:21Z

hmm this is right.. it maybe the case that 980 is not big enough to support gpu nms and 256 batch size during training, you may need some way to go over that

amirhfarzaneh · 2017-05-12T07:07:35Z

do you think disabling gpu nms will help? how do I do that?
there are two batch sizes if I'm not mistaken! should I change those to 128? What are the name of the variables for the batch sizes?

amirhfarzaneh · 2017-05-12T07:46:05Z

@endernewton The person in issue#8 also has the same problem and she's using a K40!

endernewton · 2017-05-12T08:24:56Z

@amirhfarzaneh i guess later she figure it out and the error was not nan in training

amirhfarzaneh · 2017-05-13T01:08:47Z

@endernewton Could you please share your log files for training? Especially for the voc_2007_trainval dataset with vgg16 architecture? I think this will be useful to others too. This way we can compare some statistics while we're training, like how the loss numbers should look like! Thank you in advance

endernewton · 2017-05-13T21:12:47Z

@amirhfarzaneh the original one is lost. Let me see if I can retrain to get a similar log.

endernewton · 2017-05-18T10:06:53Z

i have put up a log file at http://gs11655.sp.cs.cmu.edu/xinleic/tf-faster-rcnn/logs/vgg16_voc_2007_trainval__vgg16.txt.2017-05-14_19-26-27

dancsalo · 2017-05-21T18:54:44Z

@endernewton the link to the log file you posted appears to be broken.

I just ran the res101 model with gpu_nms and with cpu_nms. gpu_nms gave me NaN's during training; cpu_nms gave me the expected results. I am using one Titan Xp (compute capability 6.1) and configured the setup.py with 'sm_61,' following the README. Is this expected behavior?

Perhaps the OP would get expected results if they used the cpu_nms...

endernewton · 2017-05-21T20:07:27Z

Wow nice! On my side I am actually using arch_52 for both pascal and non pascal gpus, just another data point to make it work. The web server is not stable for some reason. I can move that log to google drive later.

…

Sent from my iPhone

On May 21, 2017, at 11:54, Dan Salo ***@***.***> wrote: @endernewton the link to the log file you posted appears to be broken. I just ran the res101 model with gpu_nms and with cpu_nms. gpu_nms gave me NaN's during training; cpu_nms gave me the expected results. I am using one Titan Xp (compute capability 6.1) and configured the setup.py with 'sm_61,' following the README. Is this expected behavior? Perhaps the OP would get expected results if they used the cpu_nms... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

dancsalo · 2017-05-22T17:35:36Z

@endernewton I re-ran the res101 model with gpu_nms and configured the setup.py with 'sm_52'. No NaNs, but I only got 0.65 mAP. I am going to re-run and see what the variance is.

endernewton · 2017-05-22T17:48:36Z

No 0.65 is too low.. hmm then this problem is still hidden. Did you do testing with the provided models? What map did you get?

…

Sent from my iPhone

On May 22, 2017, at 10:35, Dan Salo ***@***.***> wrote: @endernewton I re-ran the res101 model with gpu_nms and configured the setup.py with 'sm_52'. No NaNs, but I only got 0.65 mAP. I am going to re-run and see what the variance is. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

endernewton · 2017-05-22T21:38:50Z

@dancsalo maybe because you have the Xp. the code needs to get some modifications to work on more recent gpus i guess. i haven't got access to such gpus yet so i cannot help much.

amirhfarzaneh · 2017-05-22T22:25:56Z

Seems like the NaN problem occurs only on some gpus. I have a GTX 980Ti and NaN happens. I have tested the code on a Quadro M4000 and GTX 1080 and NaNs don't appear and the training goes as it should! This is my log file on a 1080Ti : https://drive.google.com/file/d/0Bz-CTQRw0GZCeTNrcjZ0OFVXRWs/view?usp=sharing

zdm123 · 2017-09-28T14:54:21Z

@amirhfarzaneh Hello, my gpu is Tesla K40c. I also meet the NaN problem, do you know how to fix it ?

aaa135511 · 2017-12-31T17:12:31Z

Hi, anyone who can tell me that it has the same effect and result with the tf_test_faster_rcnn.sh when I run the tf_train_faster_rcnn.sh .it means that the train shell didn't work at all. thanks much

nassarofficial · 2018-01-12T12:58:24Z

I had this error and the only fix was that I had problems in my xml annotation files, some were empty, and some bboxes had negative values. After eliminating them the error disappeared.

guojiapeng00 · 2018-04-30T08:33:33Z

I had this error ,too,
today, I make it!!!!!
I find there are lots of boxes outside of my pics.
for example, my pics are 600*600,but there is a box (550,550,650,650)
when i delete these pics in trainval.txt, it works!!!

henbucuoshanghai · 2019-06-12T03:14:06Z

mg picus is 1280*960 it is too big ? does it matter???? your py will resize it ???

mengce97 · 2019-07-01T13:27:47Z

hello, did u fix it？
I meet the same error and i tries it all day with no help.
If you know why it happens, please tell me, i will be very appretriate!

nassarofficial · 2019-07-08T12:11:03Z

hello, did u fix it？
I meet the same error and i tries it all day with no help.
If you know why it happens, please tell me, i will be very appretriate!
I had this error and the only fix was that I had problems in my xml annotation files, some were empty, and some bboxes had negative values. After eliminating them the error disappeared.

endernewton closed this as completed Jun 5, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NaN losses during training! #86

NaN losses during training! #86

amirhfarzaneh commented May 12, 2017

endernewton commented May 12, 2017

amirhfarzaneh commented May 12, 2017

endernewton commented May 12, 2017

amirhfarzaneh commented May 12, 2017

endernewton commented May 12, 2017

amirhfarzaneh commented May 12, 2017 •

edited

Loading

endernewton commented May 12, 2017

amirhfarzaneh commented May 12, 2017

amirhfarzaneh commented May 12, 2017

endernewton commented May 12, 2017

amirhfarzaneh commented May 13, 2017 •

edited

Loading

endernewton commented May 13, 2017

endernewton commented May 18, 2017

dancsalo commented May 21, 2017

endernewton commented May 21, 2017 via email

dancsalo commented May 22, 2017

endernewton commented May 22, 2017 via email

endernewton commented May 22, 2017

amirhfarzaneh commented May 22, 2017 •

edited

Loading

zdm123 commented Sep 28, 2017

aaa135511 commented Dec 31, 2017

nassarofficial commented Jan 12, 2018

guojiapeng00 commented Apr 30, 2018

henbucuoshanghai commented Jun 12, 2019

mengce97 commented Jul 1, 2019

nassarofficial commented Jul 8, 2019

NaN losses during training! #86

NaN losses during training! #86

Comments

amirhfarzaneh commented May 12, 2017

endernewton commented May 12, 2017

amirhfarzaneh commented May 12, 2017

endernewton commented May 12, 2017

amirhfarzaneh commented May 12, 2017

endernewton commented May 12, 2017

amirhfarzaneh commented May 12, 2017 • edited Loading

endernewton commented May 12, 2017

amirhfarzaneh commented May 12, 2017

amirhfarzaneh commented May 12, 2017

endernewton commented May 12, 2017

amirhfarzaneh commented May 13, 2017 • edited Loading

endernewton commented May 13, 2017

endernewton commented May 18, 2017

dancsalo commented May 21, 2017

endernewton commented May 21, 2017 via email

dancsalo commented May 22, 2017

endernewton commented May 22, 2017 via email

endernewton commented May 22, 2017

amirhfarzaneh commented May 22, 2017 • edited Loading

zdm123 commented Sep 28, 2017

aaa135511 commented Dec 31, 2017

nassarofficial commented Jan 12, 2018

guojiapeng00 commented Apr 30, 2018

henbucuoshanghai commented Jun 12, 2019

mengce97 commented Jul 1, 2019

nassarofficial commented Jul 8, 2019

amirhfarzaneh commented May 12, 2017 •

edited

Loading

amirhfarzaneh commented May 13, 2017 •

edited

Loading

amirhfarzaneh commented May 22, 2017 •

edited

Loading