Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

执行train.py一直OOM #32

Open
BngThea opened this issue Feb 22, 2020 · 11 comments
Open

执行train.py一直OOM #32

BngThea opened this issue Feb 22, 2020 · 11 comments

Comments

@BngThea
Copy link

BngThea commented Feb 22, 2020

运行启动gpu后
2020-02-22 10:32:07.920229: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10023 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:68:00.0, compute capability: 7.5)
提示:
UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
然后开始训练,就显存溢出
2020-02-22 10:56:41.258201: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at tile_ops.cc:220 : Resource exhausted: OOM when allocating tensor with shape[512,7,7,2048] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

我在网上搜了相关问题,提示是tf.gather,但是解决方案都是针对特定代码的,您知道这是怎么回事吗?谢谢!

@yizt
Copy link
Owner

yizt commented Feb 22, 2020

@BngThea 减少这两个参数 IMAGES_PER_GPU , IMAGE_MAX_DIM

@BngThea
Copy link
Author

BngThea commented Feb 22, 2020

@yizt 谢谢,我将IMAGES_PER_GPU从2设为1,IMAGE_MAX_DIM从720设为500,可以运行了

@BngThea
Copy link
Author

BngThea commented Feb 22, 2020

@yizt 您好,刚按上面的改了,但是训练的时候loss爆炸了,重启了几次都是如此
40/1252 [..............................] - ETA: 23:06 - loss: 245879418.9902 - rpn_bbox_loss: 0.6706 - rpn_class_loss: 0.5414 - rcnn_bbox_loss: 0.8370 - rcnn_class_loss: 1.3189 - regular_loss: 52.1087 - gt_num: 2.9813 - positive_anchor_num: 12.7000 - negative_anchor_num: 67.3000 - rpn_miss_gt_num: 0.0000e+00 - rpn_gt_min_max_iou: 0.6063 - roi_num: 1969.6062 - positive_roi_num: 20.3312 - negativ 41/1252 [..............................] - ETA: 22:45 - loss: 249477870.6246 - rpn_bbox_loss: 0.6715 - rpn_class_loss: 0.5355 - rcnn_bbox_loss: 0.8344 - rcnn_class_loss: 1.3020 - regular_loss: 52.8713 - gt_num: 2.9390 - positive_anchor_num: 12.4817 - negative_anchor_num: 67.5183 - rpn_miss_gt_num: 0.0000e+00 - rpn_gt_min_max_iou: 0.6082 - roi_num: 1970.3475 - positive_roi_num: 20.1280 - negativ 42/1252 [>.............................] - ETA: 22:25 - loss: 252904967.4192 - rpn_bbox_loss: 0.6670 - rpn_class_loss: 0.5286 - rcnn_bbox_loss: 0.8322 - rcnn_class_loss: 1.2902 - regular_loss: 53.5976 - gt_num: 2.9226 - positive_anchor_num: 12.5238 - negative_anchor_num: 67.4762 - rpn_miss_gt_num: 0.0000e+00 - rpn_gt_min_max_iou: 0.6104 - roi_num: 1971.0536 - positive_roi_num: 20.2143 - negativ 43/1252 [>.............................] - ETA: 22:06 - loss: 256172664.3630 - rpn_bbox_loss: 0.6636 - rpn_class_loss: 0.5222 - rcnn_bbox_loss: 0.8288 - rcnn_class_loss: 1.2761 - regular_loss: 54.2901 - gt_num: 2.8953 - positive_anchor_num: 12.3605 - negative_anchor_num: 67.6395 - rpn_miss_gt_num: 0.0000e+00 - rpn_gt_min_max_iou: 0.6125 - roi_num: 1971.1279 - positive_roi_num: 20.1919 - negativ 44/1252 [>.............................] - ETA: 21:47 - loss: 259291829.6274 - rpn_bbox_loss: 0.6674 - rpn_class_loss: 0.5176 - rcnn_bbox_loss: 0.8258 - rcnn_class_loss: 1.2660 - regular_loss: 54.9511 - gt_num: 2.9375 - positive_anchor_num: 12.2898 - negative_anchor_num: 67.7102 - rpn_miss_gt_num: 0.0000e+00 - rpn_gt_min_max_iou: 0.6101 - roi_num: 1970.9773 - positive_roi_num: 20.2557 - negativ 45/1252 [>.............................] - ETA: 21:30 - loss: 262272365.3246 - rpn_bbox_loss: 0.6655 - rpn_class_loss: 0.5145 - rcnn_bbox_loss: 0.8242 - rcnn_class_loss: 1.2521 - regular_loss: 55.5828 - gt_num: 2.9667 - positive_anchor_num: 12.3722 - negative_anchor_num: 67.6278 - rpn_miss_gt_num: 0.0000e+00 - rpn_gt_min_max_iou: 0.6090 - roi_num: 1971.1111 - positive_roi_num: 20.2444 - negativ

@yizt
Copy link
Owner

yizt commented Feb 22, 2020 via email

@yizt
Copy link
Owner

yizt commented Feb 22, 2020

@BngThea 请更新代码,再试试看

@BngThea
Copy link
Author

BngThea commented Feb 22, 2020

@yizt 您好,我更新后测试了5次,有两次loss增加的稍微缓慢了一些,但最终也是增加的,另外3次没有改善,甚至更快爆炸

@yizt
Copy link
Owner

yizt commented Feb 23, 2020

@BngThea 我将IMAGES_PER_GPU也设置为1,IMAGE_MAX_DIM设为500;不会出现loss爆炸;
2020-02-23 08:46:09.669647: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0 2020-02-23 08:46:10.244852: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7 1872/22136 [=>............................] - ETA: 1:19:18 - loss: 4.4972 - rpn_bbox_loss: 0.7863 - rpn_class_loss: 0.2431 - rcnn_bbox_loss: 0.6923 - rcnn_class_loss: 0.6140 - regular_loss: 13.5618 - gt_num: 2.5067 - positive_anchor_num: 6.9930 - negative_anchor_num: 73.0070 - rpn_miss_gt_num: 0.0000e+00 - rpn_gt_min_max_iou: 0.5722 - roi_num: 1426.8147 - positive_roi_num: 15.3840 - negative_roi_num: 10

另外IMAGES_PER_GPU=3,IMAGE_MAX_DIM=720在RTX 2080 Ti上跑没有问题,我也是用的RTX 2080 Ti

@BngThea
Copy link
Author

BngThea commented Feb 23, 2020

@yizt 那您还是用的1.9版本的tf吗,我现在用的1.14版本的,因为cuda版本是10.1的
其他demo都是在2.x或者1.14下跑的,不想改动cuda版本,这会有影响吗,keras用的2.2.5

@yizt
Copy link
Owner

yizt commented Feb 23, 2020

@BngThea tf版本也是1.14,cuda是V10.0.130; 现在工程用的是tf自带的keras

@BngThea
Copy link
Author

BngThea commented Feb 25, 2020

@yizt 很奇怪,我在Ubuntu18.04环境下同硬件配置下就会出现loss爆炸,而在win10下却可以正常跑

另外还有几个问题:
1 我用resnet跑了80个epoch,loss值在0.3左右,mAP值却很低,您跑出来的loss大概什么值
2 我有自己的一批数据集,已经整理为VOC2007格式的了,其中每幅图gt就1个,size固定为378*427,该如何调整config配置来进行训练?固定size我通过调整对应函数搞定了,用您的模型默认跑出来的结果和tensorflow版本的faster rcnn (https://github.com/smallcorgi/Faster-RCNN_TF) 差距很大,想知道config中的其他参数如何调整?生成anchor的gt函数该怎么设置cluster数?谢谢!

@wanghangege
Copy link

请问训练到一半就停止了,是在进行测试吗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants