Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RAM usage continues to grow and the training process stopped without error !!! #103

Closed
JinYAnGHe opened this issue Jul 23, 2021 · 25 comments

Comments

@JinYAnGHe
Copy link

During training, RAM usage continues to grow. Finaily, the training process stopped. It is a bug?

2021-07-23 14:46:56 | INFO | yolox.core.trainer:237 - epoch: 9/100, iter: 2000/3905, mem: 1730Mb, iter_time: 0.165s, data_time: 0.001s, total_loss: 3.3, iou_loss: 1.7, l1_loss: 0.0, conf_loss: 1.2, cls_loss: 0.5, lr: 9.955e-03, size: 320, ETA: 16:27:19
2021-07-23 14:47:04 | INFO | yolox.core.trainer:237 - epoch: 9/100, iter: 2050/3905, mem: 1730Mb, iter_time: 0.166s, data_time: 0.001s, total_loss: 3.0, iou_loss: 1.8, l1_loss: 0.0, conf_loss: 0.7, cls_loss: 0.5, lr: 9.955e-03, size: 320, ETA: 16:27:11
2021-07-23 14:47:13 | INFO | yolox.core.trainer:237 - epoch: 9/100, iter: 2100/3905, mem: 1730Mb, iter_time: 0.165s, data_time: 0.001s, total_loss: 2.9, iou_loss: 1.8, l1_loss: 0.0, conf_loss: 0.6, cls_loss: 0.5, lr: 9.954e-03, size: 320, ETA: 16:27:02
2021-07-23 14:47:21 | INFO | yolox.core.trainer:237 - epoch: 9/100, iter: 2150/3905, mem: 1730Mb, iter_time: 0.166s, data_time: 0.001s, total_loss: 3.3, iou_loss: 1.9, l1_loss: 0.0, conf_loss: 1.0, cls_loss: 0.5, lr: 9.954e-03, size: 320, ETA: 16:26:54
2021-07-23 14:47:30 | INFO | yolox.core.trainer:237 - epoch: 9/100, iter: 2200/3905, mem: 1730Mb, iter_time: 0.177s, data_time: 0.002s, total_loss: 2.3, iou_loss: 1.3, l1_loss: 0.0, conf_loss: 0.6, cls_loss: 0.4, lr: 9.954e-03, size: 320, ETA: 16:26:51
2021-07-23 14:47:38 | INFO | yolox.core.trainer:237 - epoch: 9/100, iter: 2250/3905, mem: 1730Mb, iter_time: 0.166s, data_time: 0.001s, total_loss: 2.9, iou_loss: 1.7, l1_loss: 0.0, conf_loss: 0.7, cls_loss: 0.5, lr: 9.953e-03, size: 320, ETA: 16:26:43
2021-07-23 14:47:47 | INFO | yolox.core.trainer:237 - epoch: 9/100, iter: 2300/3905, mem: 1730Mb, iter_time: 0.168s, data_time: 0.001s, total_loss: 2.4, iou_loss: 1.5, l1_loss: 0.0, conf_loss: 0.5, cls_loss: 0.4, lr: 9.953e-03, size: 320, ETA: 16:26:36
------------------------stopped here--------------------

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... On | 00000000:3D:00.0 Off | N/A |
| 28% 50C P2 109W / 250W | 2050MiB / 11019MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 208... On | 00000000:3E:00.0 Off | N/A |
| 27% 49C P2 103W / 250W | 2086MiB / 11019MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce RTX 208... On | 00000000:41:00.0 Off | N/A |
| 25% 48C P2 117W / 250W | 2086MiB / 11019MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce RTX 208... On | 00000000:42:00.0 Off | N/A |
| 28% 50C P2 113W / 250W | 2086MiB / 11019MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 4 GeForce RTX 208... On | 00000000:44:00.0 Off | N/A |
| 16% 27C P8 21W / 250W | 11MiB / 11019MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 GeForce RTX 208... On | 00000000:45:00.0 Off | N/A |
| 28% 50C P2 110W / 250W | 2086MiB / 11019MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 6 GeForce RTX 208... On | 00000000:46:00.0 Off | N/A |
| 24% 47C P2 95W / 250W | 2086MiB / 11019MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 7 GeForce RTX 208... On | 00000000:47:00.0 Off | N/A |
| 26% 49C P2 99W / 250W | 2086MiB / 11019MiB | 100% Default |
+-------------------------------+----------------------+----------------------+

是RAM持续增长,然后溢出,导致程序停止?

@GOATmessi8
Copy link
Member

#91 Someone meets the same issue. We have no idea what happened now as we cannot reproduce this bug. Could you set the num_workers to 0 and try it again?

@JinYAnGHe
Copy link
Author

#91 Someone meets the same issue. We have no idea what happened now as we cannot reproduce this bug. Could you set the num_workers to 0 and try it again?

I have tried. The result is same with #91, the RAM also continues to grow. Just slower than num_worker = 4.

@GOATmessi8
Copy link
Member

What's the version of your torch? And could you provide more information about your env so we can reproduce this problem? Thx~

@VantastyChen
Copy link

同样遇到这个问题,训练自定义coco类型数据集时发现内存占用会不断增长直至溢出,环境是ubuntu18.04 anaconda环境,pytorch版本1.8.1, py3.7_cuda10.2_cudnn7.6.5_0, 然后单卡、四卡都会出现这个问题。

@JinYAnGHe
Copy link
Author

THx too~
My ENV:
Ubuntu 16.04
CUDA 10.2
RTX 2080Ti
python 3.8
torch 1.8.0
torchvision 0.9.0

By the way:
或许你们可以提供一个更加兼容的requirement.txt,包含每个包的版本,可以兼容的环境等等?

@JinYAnGHe
Copy link
Author

同样遇到这个问题,训练自定义coco类型数据集时发现内存占用会不断增长直至溢出,环境是ubuntu18.04 anaconda环境,pytorch版本1.8.1, py3.7_cuda10.2_cudnn7.6.5_0, 然后单卡、四卡都会出现这个问题。

我训练的也是自定义的COCO格式数据集,不过好像跟数据集没关系 #91 是VOC

@Hiwyl
Copy link

Hiwyl commented Jul 23, 2021

我训练自定义的VOC格式数据,也遇到了同样的问题,在204/300处断了

@GOATmessi8
Copy link
Member

OK, keep this issue opening and we'll try to reproduce it first.

THx too~
My ENV:
Ubuntu 16.04
CUDA 10.2
RTX 2080Ti
python 3.8
torch 1.8.0
torchvision 0.9.0

By the way:
或许你们可以提供一个更加兼容的requirement.txt,包含每个包的版本,可以兼容的环境等等?

Thx for your advice and we are still finding the oldest working version for each package.

@JinYAnGHe
Copy link
Author

OK, keep this issue opening and we'll try to reproduce it first.

THx too~
My ENV:
Ubuntu 16.04
CUDA 10.2
RTX 2080Ti
python 3.8
torch 1.8.0
torchvision 0.9.0
By the way:
或许你们可以提供一个更加兼容的requirement.txt,包含每个包的版本,可以兼容的环境等等?

Thx for your advice and we are still finding the oldest working version for each package.

Thank you for providing such a great project and waiting for the good news.

@GOATmessi8
Copy link
Member

@JinYAnGHe @Hiwyl @VantastyChen How much RAM do you have for training?

@VantastyChen
Copy link

@JinYAnGHe @Hiwyl @VantastyChen How much RAM do you have for training?

32G RAM,bs为32,yolox-l网络,开始训练时是占用14G,然后慢慢增加。

@JinYAnGHe
Copy link
Author

@ruinmessi 128GB

@GOATmessi8
Copy link
Member

We have reproduced this problem and are trying to fix it now.

@GOATmessi8
Copy link
Member

Hey, guys! We currently do not fix the memory leak issues, but we change the multiprocess backend from spawn to subprocess. In our test, although memory would still increase during training, the training process would not crash. So we recommend you to pull the latest update and reinstall yolox, and then retry your exp.

@VantastyChen
Copy link

Hey, guys! We currently do not fix the memory leak issues, but we change the multiprocess backend from spawn to subprocess. In our test, although memory would still increase during training, the training process would not crash. So we recommend you to pull the latest update and reinstall yolox, and then retry your exp.

Thx.

@shachargluska
Copy link
Contributor

Do you have a specific setup in which this problem doesn't occur?
I use nvcr.io/nvidia/pytorch:21.06-py3 docker image and able to reproduce the problem.
torch version: 1.9.0a0+c3d40fd
cuda: 11.3, V11.3.109
cudnn: 8.2.1

@GOATmessi8
Copy link
Member

Sorry, guys.... we found some error in the above updates and have to revert to the original spawn version. Currently, we are rewriting the whole dataloader to skip this bug.

@yonomitt
Copy link

@ruinmessi locally, I reworked the COCODataset to create numpy arrays of all the info the COCO class provides upfront. When I tested this version of the COCODataset by itself using this notebook:

https://gist.github.com/mprostock/2850f3cd465155689052f0fa3a177a50

I see that the original version "leaks" memory and that my numpy array-based one does not. However, when run it in YOLOX, I still get the memory leak. So maybe something downstream like the MosaicDetection is also contributing to this problem?

@F0xZz
Copy link
Contributor

F0xZz commented Jul 28, 2021

Firstly Thank you for you awesome work.And sorry to bother the author, I was wanna to know the which part of the code has cause the problem ,you said got know the reason of the problem.

@GOATmessi8
Copy link
Member

@yonomitt @VantastyChen @JinYAnGHe Finally.... we fix the bug and currently the training memory curve is like (yolox-tiny with 128 batchsize in 8 gpus):
image

@GOATmessi8
Copy link
Member

Firstly Thank you for you awesome work.And sorry to bother the author, I was wanna to know the which part of the code has cause the problem ,you said got know the reason of the problem.

Well, we found two major issues. One is our coco dataset: we must load the whole annos during the initial; Another is about loss log: we have to detach and pick the loss value instead of the single tensor for log.

@F0xZz
Copy link
Contributor

F0xZz commented Jul 28, 2021

Firstly Thank you for you awesome work.And sorry to bother the author, I was wanna to know the which part of the code has cause the problem ,you said got know the reason of the problem.

Well, we found two major issues. One is our coco dataset: we must load the whole annos during the initial; Another is about loss log: we have to detach and pick the loss value instead of the single tensor for log.

Thanks for you awesome work !

@VantastyChen
Copy link

Firstly Thank you for you awesome work.And sorry to bother the author, I was wanna to know the which part of the code has cause the problem ,you said got know the reason of the problem.

Well, we found two major issues. One is our coco dataset: we must load the whole annos during the initial; Another is about loss log: we have to detach and pick the loss value instead of the single tensor for log.

Thanks!

@FatmaMazen2021
Copy link

Firstly Thank you for you awesome work.And sorry to bother the author, I was wanna to know the which part of the code has cause the problem ,you said got know the reason of the problem.

Well, we found two major issues. One is our coco dataset: we must load the whole annos during the initial; Another is about loss log: we have to detach and pick the loss value instead of the single tensor for log.

Could you kindly tell me how you implemented this part?
I am getting stuck in this problem

@PurvangL
Copy link

@GOATmessi7 I am facing similar memory leak issue. Could you provide your solution here? Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants