-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RAM usage continues to grow and the training process stopped without error !!! #103
Comments
#91 Someone meets the same issue. We have no idea what happened now as we cannot reproduce this bug. Could you set the num_workers to 0 and try it again? |
What's the version of your torch? And could you provide more information about your env so we can reproduce this problem? Thx~ |
同样遇到这个问题,训练自定义coco类型数据集时发现内存占用会不断增长直至溢出,环境是ubuntu18.04 anaconda环境,pytorch版本1.8.1, py3.7_cuda10.2_cudnn7.6.5_0, 然后单卡、四卡都会出现这个问题。 |
THx too~ By the way: |
我训练的也是自定义的COCO格式数据集,不过好像跟数据集没关系 #91 是VOC |
我训练自定义的VOC格式数据,也遇到了同样的问题,在204/300处断了 |
OK, keep this issue opening and we'll try to reproduce it first.
Thx for your advice and we are still finding the oldest working version for each package. |
Thank you for providing such a great project and waiting for the good news. |
@JinYAnGHe @Hiwyl @VantastyChen How much RAM do you have for training? |
32G RAM,bs为32,yolox-l网络,开始训练时是占用14G,然后慢慢增加。 |
@ruinmessi 128GB |
We have reproduced this problem and are trying to fix it now. |
Hey, guys! We currently do not fix the memory leak issues, but we change the multiprocess backend from spawn to subprocess. In our test, although memory would still increase during training, the training process would not crash. So we recommend you to pull the latest update and reinstall yolox, and then retry your exp. |
Thx. |
Do you have a specific setup in which this problem doesn't occur? |
Sorry, guys.... we found some error in the above updates and have to revert to the original spawn version. Currently, we are rewriting the whole dataloader to skip this bug. |
@ruinmessi locally, I reworked the https://gist.github.com/mprostock/2850f3cd465155689052f0fa3a177a50 I see that the original version "leaks" memory and that my numpy array-based one does not. However, when run it in YOLOX, I still get the memory leak. So maybe something downstream like the |
Firstly Thank you for you awesome work.And sorry to bother the author, I was wanna to know the which part of the code has cause the problem ,you said got know the reason of the problem. |
@yonomitt @VantastyChen @JinYAnGHe Finally.... we fix the bug and currently the training memory curve is like (yolox-tiny with 128 batchsize in 8 gpus): |
Well, we found two major issues. One is our coco dataset: we must load the whole annos during the initial; Another is about loss log: we have to detach and pick the loss value instead of the single tensor for log. |
Thanks for you awesome work ! |
Thanks! |
Could you kindly tell me how you implemented this part? |
@GOATmessi7 I am facing similar memory leak issue. Could you provide your solution here? Thanks |
During training, RAM usage continues to grow. Finaily, the training process stopped. It is a bug?
2021-07-23 14:46:56 | INFO | yolox.core.trainer:237 - epoch: 9/100, iter: 2000/3905, mem: 1730Mb, iter_time: 0.165s, data_time: 0.001s, total_loss: 3.3, iou_loss: 1.7, l1_loss: 0.0, conf_loss: 1.2, cls_loss: 0.5, lr: 9.955e-03, size: 320, ETA: 16:27:19
2021-07-23 14:47:04 | INFO | yolox.core.trainer:237 - epoch: 9/100, iter: 2050/3905, mem: 1730Mb, iter_time: 0.166s, data_time: 0.001s, total_loss: 3.0, iou_loss: 1.8, l1_loss: 0.0, conf_loss: 0.7, cls_loss: 0.5, lr: 9.955e-03, size: 320, ETA: 16:27:11
2021-07-23 14:47:13 | INFO | yolox.core.trainer:237 - epoch: 9/100, iter: 2100/3905, mem: 1730Mb, iter_time: 0.165s, data_time: 0.001s, total_loss: 2.9, iou_loss: 1.8, l1_loss: 0.0, conf_loss: 0.6, cls_loss: 0.5, lr: 9.954e-03, size: 320, ETA: 16:27:02
2021-07-23 14:47:21 | INFO | yolox.core.trainer:237 - epoch: 9/100, iter: 2150/3905, mem: 1730Mb, iter_time: 0.166s, data_time: 0.001s, total_loss: 3.3, iou_loss: 1.9, l1_loss: 0.0, conf_loss: 1.0, cls_loss: 0.5, lr: 9.954e-03, size: 320, ETA: 16:26:54
2021-07-23 14:47:30 | INFO | yolox.core.trainer:237 - epoch: 9/100, iter: 2200/3905, mem: 1730Mb, iter_time: 0.177s, data_time: 0.002s, total_loss: 2.3, iou_loss: 1.3, l1_loss: 0.0, conf_loss: 0.6, cls_loss: 0.4, lr: 9.954e-03, size: 320, ETA: 16:26:51
2021-07-23 14:47:38 | INFO | yolox.core.trainer:237 - epoch: 9/100, iter: 2250/3905, mem: 1730Mb, iter_time: 0.166s, data_time: 0.001s, total_loss: 2.9, iou_loss: 1.7, l1_loss: 0.0, conf_loss: 0.7, cls_loss: 0.5, lr: 9.953e-03, size: 320, ETA: 16:26:43
2021-07-23 14:47:47 | INFO | yolox.core.trainer:237 - epoch: 9/100, iter: 2300/3905, mem: 1730Mb, iter_time: 0.168s, data_time: 0.001s, total_loss: 2.4, iou_loss: 1.5, l1_loss: 0.0, conf_loss: 0.5, cls_loss: 0.4, lr: 9.953e-03, size: 320, ETA: 16:26:36
------------------------stopped here--------------------
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... On | 00000000:3D:00.0 Off | N/A |
| 28% 50C P2 109W / 250W | 2050MiB / 11019MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 208... On | 00000000:3E:00.0 Off | N/A |
| 27% 49C P2 103W / 250W | 2086MiB / 11019MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce RTX 208... On | 00000000:41:00.0 Off | N/A |
| 25% 48C P2 117W / 250W | 2086MiB / 11019MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce RTX 208... On | 00000000:42:00.0 Off | N/A |
| 28% 50C P2 113W / 250W | 2086MiB / 11019MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 4 GeForce RTX 208... On | 00000000:44:00.0 Off | N/A |
| 16% 27C P8 21W / 250W | 11MiB / 11019MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 GeForce RTX 208... On | 00000000:45:00.0 Off | N/A |
| 28% 50C P2 110W / 250W | 2086MiB / 11019MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 6 GeForce RTX 208... On | 00000000:46:00.0 Off | N/A |
| 24% 47C P2 95W / 250W | 2086MiB / 11019MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 7 GeForce RTX 208... On | 00000000:47:00.0 Off | N/A |
| 26% 49C P2 99W / 250W | 2086MiB / 11019MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
是RAM持续增长,然后溢出,导致程序停止?
The text was updated successfully, but these errors were encountered: