-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
执行train.py一直OOM #32
Comments
@BngThea 减少这两个参数 IMAGES_PER_GPU , IMAGE_MAX_DIM |
@yizt 谢谢,我将IMAGES_PER_GPU从2设为1,IMAGE_MAX_DIM从720设为500,可以运行了 |
@yizt 您好,刚按上面的改了,但是训练的时候loss爆炸了,重启了几次都是如此 |
下午我测试下,给您回复哈
| |
易作天
|
|
邮箱:csuyzt@163.com
|
签名由 网易邮箱大师 定制
在2020年02月22日 11:41,BngThea 写道:
@yizt 您好,刚按上面的改了,但是训练的时候loss爆炸了,重启了几次都是如此
40/1252 [..............................] - ETA: 23:06 - loss: 245879418.9902 - rpn_bbox_loss: 0.6706 - rpn_class_loss: 0.5414 - rcnn_bbox_loss: 0.8370 - rcnn_class_loss: 1.3189 - regular_loss: 52.1087 - gt_num: 2.9813 - positive_anchor_num: 12.7000 - negative_anchor_num: 67.3000 - rpn_miss_gt_num: 0.0000e+00 - rpn_gt_min_max_iou: 0.6063 - roi_num: 1969.6062 - positive_roi_num: 20.3312 - negativ 41/1252 [..............................] - ETA: 22:45 - loss: 249477870.6246 - rpn_bbox_loss: 0.6715 - rpn_class_loss: 0.5355 - rcnn_bbox_loss: 0.8344 - rcnn_class_loss: 1.3020 - regular_loss: 52.8713 - gt_num: 2.9390 - positive_anchor_num: 12.4817 - negative_anchor_num: 67.5183 - rpn_miss_gt_num: 0.0000e+00 - rpn_gt_min_max_iou: 0.6082 - roi_num: 1970.3475 - positive_roi_num: 20.1280 - negativ 42/1252 [>.............................] - ETA: 22:25 - loss: 252904967.4192 - rpn_bbox_loss: 0.6670 - rpn_class_loss: 0.5286 - rcnn_bbox_loss: 0.8322 - rcnn_class_loss: 1.2902 - regular_loss: 53.5976 - gt_num: 2.9226 - positive_anchor_num: 12.5238 - negative_anchor_num: 67.4762 - rpn_miss_gt_num: 0.0000e+00 - rpn_gt_min_max_iou: 0.6104 - roi_num: 1971.0536 - positive_roi_num: 20.2143 - negativ 43/1252 [>.............................] - ETA: 22:06 - loss: 256172664.3630 - rpn_bbox_loss: 0.6636 - rpn_class_loss: 0.5222 - rcnn_bbox_loss: 0.8288 - rcnn_class_loss: 1.2761 - regular_loss: 54.2901 - gt_num: 2.8953 - positive_anchor_num: 12.3605 - negative_anchor_num: 67.6395 - rpn_miss_gt_num: 0.0000e+00 - rpn_gt_min_max_iou: 0.6125 - roi_num: 1971.1279 - positive_roi_num: 20.1919 - negativ 44/1252 [>.............................] - ETA: 21:47 - loss: 259291829.6274 - rpn_bbox_loss: 0.6674 - rpn_class_loss: 0.5176 - rcnn_bbox_loss: 0.8258 - rcnn_class_loss: 1.2660 - regular_loss: 54.9511 - gt_num: 2.9375 - positive_anchor_num: 12.2898 - negative_anchor_num: 67.7102 - rpn_miss_gt_num: 0.0000e+00 - rpn_gt_min_max_iou: 0.6101 - roi_num: 1970.9773 - positive_roi_num: 20.2557 - negativ 45/1252 [>.............................] - ETA: 21:30 - loss: 262272365.3246 - rpn_bbox_loss: 0.6655 - rpn_class_loss: 0.5145 - rcnn_bbox_loss: 0.8242 - rcnn_class_loss: 1.2521 - regular_loss: 55.5828 - gt_num: 2.9667 - positive_anchor_num: 12.3722 - negative_anchor_num: 67.6278 - rpn_miss_gt_num: 0.0000e+00 - rpn_gt_min_max_iou: 0.6090 - roi_num: 1971.1111 - positive_roi_num: 20.2444 - negativ
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
@BngThea 请更新代码,再试试看 |
@yizt 您好,我更新后测试了5次,有两次loss增加的稍微缓慢了一些,但最终也是增加的,另外3次没有改善,甚至更快爆炸 |
@BngThea 我将IMAGES_PER_GPU也设置为1,IMAGE_MAX_DIM设为500;不会出现loss爆炸; 另外 |
@yizt 那您还是用的1.9版本的tf吗,我现在用的1.14版本的,因为cuda版本是10.1的 |
@BngThea tf版本也是1.14,cuda是V10.0.130; 现在工程用的是tf自带的keras |
@yizt 很奇怪,我在Ubuntu18.04环境下同硬件配置下就会出现loss爆炸,而在win10下却可以正常跑 另外还有几个问题: |
请问训练到一半就停止了,是在进行测试吗 |
运行启动gpu后
2020-02-22 10:32:07.920229: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10023 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:68:00.0, compute capability: 7.5)
提示:
UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
然后开始训练,就显存溢出
2020-02-22 10:56:41.258201: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at tile_ops.cc:220 : Resource exhausted: OOM when allocating tensor with shape[512,7,7,2048] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
我在网上搜了相关问题,提示是tf.gather,但是解决方案都是针对特定代码的,您知道这是怎么回事吗?谢谢!
The text was updated successfully, but these errors were encountered: