执行train.py一直OOM #32

BngThea · 2020-02-22T03:02:27Z

运行启动gpu后
2020-02-22 10:32:07.920229: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10023 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:68:00.0, compute capability: 7.5)
提示：
UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
然后开始训练，就显存溢出
2020-02-22 10:56:41.258201: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at tile_ops.cc:220 : Resource exhausted: OOM when allocating tensor with shape[512,7,7,2048] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

我在网上搜了相关问题，提示是tf.gather，但是解决方案都是针对特定代码的，您知道这是怎么回事吗？谢谢！

The text was updated successfully, but these errors were encountered:

yizt · 2020-02-22T03:11:53Z

@BngThea 减少这两个参数 IMAGES_PER_GPU , IMAGE_MAX_DIM

BngThea · 2020-02-22T03:30:48Z

@yizt 谢谢，我将IMAGES_PER_GPU从2设为1，IMAGE_MAX_DIM从720设为500，可以运行了

BngThea · 2020-02-22T03:41:26Z

@yizt 您好，刚按上面的改了，但是训练的时候loss爆炸了，重启了几次都是如此
40/1252 [..............................] - ETA: 23:06 - loss: 245879418.9902 - rpn_bbox_loss: 0.6706 - rpn_class_loss: 0.5414 - rcnn_bbox_loss: 0.8370 - rcnn_class_loss: 1.3189 - regular_loss: 52.1087 - gt_num: 2.9813 - positive_anchor_num: 12.7000 - negative_anchor_num: 67.3000 - rpn_miss_gt_num: 0.0000e+00 - rpn_gt_min_max_iou: 0.6063 - roi_num: 1969.6062 - positive_roi_num: 20.3312 - negativ 41/1252 [..............................] - ETA: 22:45 - loss: 249477870.6246 - rpn_bbox_loss: 0.6715 - rpn_class_loss: 0.5355 - rcnn_bbox_loss: 0.8344 - rcnn_class_loss: 1.3020 - regular_loss: 52.8713 - gt_num: 2.9390 - positive_anchor_num: 12.4817 - negative_anchor_num: 67.5183 - rpn_miss_gt_num: 0.0000e+00 - rpn_gt_min_max_iou: 0.6082 - roi_num: 1970.3475 - positive_roi_num: 20.1280 - negativ 42/1252 [>.............................] - ETA: 22:25 - loss: 252904967.4192 - rpn_bbox_loss: 0.6670 - rpn_class_loss: 0.5286 - rcnn_bbox_loss: 0.8322 - rcnn_class_loss: 1.2902 - regular_loss: 53.5976 - gt_num: 2.9226 - positive_anchor_num: 12.5238 - negative_anchor_num: 67.4762 - rpn_miss_gt_num: 0.0000e+00 - rpn_gt_min_max_iou: 0.6104 - roi_num: 1971.0536 - positive_roi_num: 20.2143 - negativ 43/1252 [>.............................] - ETA: 22:06 - loss: 256172664.3630 - rpn_bbox_loss: 0.6636 - rpn_class_loss: 0.5222 - rcnn_bbox_loss: 0.8288 - rcnn_class_loss: 1.2761 - regular_loss: 54.2901 - gt_num: 2.8953 - positive_anchor_num: 12.3605 - negative_anchor_num: 67.6395 - rpn_miss_gt_num: 0.0000e+00 - rpn_gt_min_max_iou: 0.6125 - roi_num: 1971.1279 - positive_roi_num: 20.1919 - negativ 44/1252 [>.............................] - ETA: 21:47 - loss: 259291829.6274 - rpn_bbox_loss: 0.6674 - rpn_class_loss: 0.5176 - rcnn_bbox_loss: 0.8258 - rcnn_class_loss: 1.2660 - regular_loss: 54.9511 - gt_num: 2.9375 - positive_anchor_num: 12.2898 - negative_anchor_num: 67.7102 - rpn_miss_gt_num: 0.0000e+00 - rpn_gt_min_max_iou: 0.6101 - roi_num: 1970.9773 - positive_roi_num: 20.2557 - negativ 45/1252 [>.............................] - ETA: 21:30 - loss: 262272365.3246 - rpn_bbox_loss: 0.6655 - rpn_class_loss: 0.5145 - rcnn_bbox_loss: 0.8242 - rcnn_class_loss: 1.2521 - regular_loss: 55.5828 - gt_num: 2.9667 - positive_anchor_num: 12.3722 - negative_anchor_num: 67.6278 - rpn_miss_gt_num: 0.0000e+00 - rpn_gt_min_max_iou: 0.6090 - roi_num: 1971.1111 - positive_roi_num: 20.2444 - negativ

yizt · 2020-02-22T04:07:51Z

下午我测试下，给您回复哈 | | 易作天 | | 邮箱：csuyzt@163.com | 签名由网易邮箱大师定制在2020年02月22日 11:41，BngThea 写道： @yizt 您好，刚按上面的改了，但是训练的时候loss爆炸了，重启了几次都是如此 40/1252 [..............................] - ETA: 23:06 - loss: 245879418.9902 - rpn_bbox_loss: 0.6706 - rpn_class_loss: 0.5414 - rcnn_bbox_loss: 0.8370 - rcnn_class_loss: 1.3189 - regular_loss: 52.1087 - gt_num: 2.9813 - positive_anchor_num: 12.7000 - negative_anchor_num: 67.3000 - rpn_miss_gt_num: 0.0000e+00 - rpn_gt_min_max_iou: 0.6063 - roi_num: 1969.6062 - positive_roi_num: 20.3312 - negativ 41/1252 [..............................] - ETA: 22:45 - loss: 249477870.6246 - rpn_bbox_loss: 0.6715 - rpn_class_loss: 0.5355 - rcnn_bbox_loss: 0.8344 - rcnn_class_loss: 1.3020 - regular_loss: 52.8713 - gt_num: 2.9390 - positive_anchor_num: 12.4817 - negative_anchor_num: 67.5183 - rpn_miss_gt_num: 0.0000e+00 - rpn_gt_min_max_iou: 0.6082 - roi_num: 1970.3475 - positive_roi_num: 20.1280 - negativ 42/1252 [>.............................] - ETA: 22:25 - loss: 252904967.4192 - rpn_bbox_loss: 0.6670 - rpn_class_loss: 0.5286 - rcnn_bbox_loss: 0.8322 - rcnn_class_loss: 1.2902 - regular_loss: 53.5976 - gt_num: 2.9226 - positive_anchor_num: 12.5238 - negative_anchor_num: 67.4762 - rpn_miss_gt_num: 0.0000e+00 - rpn_gt_min_max_iou: 0.6104 - roi_num: 1971.0536 - positive_roi_num: 20.2143 - negativ 43/1252 [>.............................] - ETA: 22:06 - loss: 256172664.3630 - rpn_bbox_loss: 0.6636 - rpn_class_loss: 0.5222 - rcnn_bbox_loss: 0.8288 - rcnn_class_loss: 1.2761 - regular_loss: 54.2901 - gt_num: 2.8953 - positive_anchor_num: 12.3605 - negative_anchor_num: 67.6395 - rpn_miss_gt_num: 0.0000e+00 - rpn_gt_min_max_iou: 0.6125 - roi_num: 1971.1279 - positive_roi_num: 20.1919 - negativ 44/1252 [>.............................] - ETA: 21:47 - loss: 259291829.6274 - rpn_bbox_loss: 0.6674 - rpn_class_loss: 0.5176 - rcnn_bbox_loss: 0.8258 - rcnn_class_loss: 1.2660 - regular_loss: 54.9511 - gt_num: 2.9375 - positive_anchor_num: 12.2898 - negative_anchor_num: 67.7102 - rpn_miss_gt_num: 0.0000e+00 - rpn_gt_min_max_iou: 0.6101 - roi_num: 1970.9773 - positive_roi_num: 20.2557 - negativ 45/1252 [>.............................] - ETA: 21:30 - loss: 262272365.3246 - rpn_bbox_loss: 0.6655 - rpn_class_loss: 0.5145 - rcnn_bbox_loss: 0.8242 - rcnn_class_loss: 1.2521 - regular_loss: 55.5828 - gt_num: 2.9667 - positive_anchor_num: 12.3722 - negative_anchor_num: 67.6278 - rpn_miss_gt_num: 0.0000e+00 - rpn_gt_min_max_iou: 0.6090 - roi_num: 1971.1111 - positive_roi_num: 20.2444 - negativ — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

yizt · 2020-02-22T08:36:19Z

@BngThea 请更新代码，再试试看

BngThea · 2020-02-22T09:32:27Z

@yizt 您好，我更新后测试了5次，有两次loss增加的稍微缓慢了一些，但最终也是增加的，另外3次没有改善，甚至更快爆炸

yizt · 2020-02-23T00:57:49Z

@BngThea 我将IMAGES_PER_GPU也设置为1，IMAGE_MAX_DIM设为500；不会出现loss爆炸；
2020-02-23 08:46:09.669647: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0 2020-02-23 08:46:10.244852: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7 1872/22136 [=>............................] - ETA: 1:19:18 - loss: 4.4972 - rpn_bbox_loss: 0.7863 - rpn_class_loss: 0.2431 - rcnn_bbox_loss: 0.6923 - rcnn_class_loss: 0.6140 - regular_loss: 13.5618 - gt_num: 2.5067 - positive_anchor_num: 6.9930 - negative_anchor_num: 73.0070 - rpn_miss_gt_num: 0.0000e+00 - rpn_gt_min_max_iou: 0.5722 - roi_num: 1426.8147 - positive_roi_num: 15.3840 - negative_roi_num: 10

另外IMAGES_PER_GPU=3，IMAGE_MAX_DIM=720在RTX 2080 Ti上跑没有问题，我也是用的RTX 2080 Ti

BngThea · 2020-02-23T07:41:54Z

@yizt 那您还是用的1.9版本的tf吗，我现在用的1.14版本的，因为cuda版本是10.1的
其他demo都是在2.x或者1.14下跑的，不想改动cuda版本，这会有影响吗，keras用的2.2.5

yizt · 2020-02-23T08:13:45Z

@BngThea tf版本也是1.14，cuda是V10.0.130; 现在工程用的是tf自带的keras

BngThea · 2020-02-25T06:26:02Z

@yizt 很奇怪，我在Ubuntu18.04环境下同硬件配置下就会出现loss爆炸，而在win10下却可以正常跑

另外还有几个问题：
1 我用resnet跑了80个epoch，loss值在0.3左右，mAP值却很低，您跑出来的loss大概什么值
2 我有自己的一批数据集，已经整理为VOC2007格式的了，其中每幅图gt就1个，size固定为378*427，该如何调整config配置来进行训练？固定size我通过调整对应函数搞定了，用您的模型默认跑出来的结果和tensorflow版本的faster rcnn （https://github.com/smallcorgi/Faster-RCNN_TF）差距很大，想知道config中的其他参数如何调整？生成anchor的gt函数该怎么设置cluster数？谢谢！

wanghangege · 2020-10-23T11:37:49Z

请问训练到一半就停止了，是在进行测试吗

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

执行train.py一直OOM #32

执行train.py一直OOM #32

BngThea commented Feb 22, 2020

yizt commented Feb 22, 2020

BngThea commented Feb 22, 2020

BngThea commented Feb 22, 2020

yizt commented Feb 22, 2020 via email

yizt commented Feb 22, 2020

BngThea commented Feb 22, 2020

yizt commented Feb 23, 2020

BngThea commented Feb 23, 2020

yizt commented Feb 23, 2020

BngThea commented Feb 25, 2020 •

edited

Loading

wanghangege commented Oct 23, 2020

执行train.py一直OOM #32

执行train.py一直OOM #32

Comments

BngThea commented Feb 22, 2020

yizt commented Feb 22, 2020

BngThea commented Feb 22, 2020

BngThea commented Feb 22, 2020

yizt commented Feb 22, 2020 via email

yizt commented Feb 22, 2020

BngThea commented Feb 22, 2020

yizt commented Feb 23, 2020

BngThea commented Feb 23, 2020

yizt commented Feb 23, 2020

BngThea commented Feb 25, 2020 • edited Loading

wanghangege commented Oct 23, 2020

BngThea commented Feb 25, 2020 •

edited

Loading