RAM usage continues to grow and the training process stopped without error !!! #103

JinYAnGHe · 2021-07-23T06:59:49Z

During training, RAM usage continues to grow. Finaily, the training process stopped. It is a bug?

2021-07-23 14:46:56 | INFO | yolox.core.trainer:237 - epoch: 9/100, iter: 2000/3905, mem: 1730Mb, iter_time: 0.165s, data_time: 0.001s, total_loss: 3.3, iou_loss: 1.7, l1_loss: 0.0, conf_loss: 1.2, cls_loss: 0.5, lr: 9.955e-03, size: 320, ETA: 16:27:19
2021-07-23 14:47:04 | INFO | yolox.core.trainer:237 - epoch: 9/100, iter: 2050/3905, mem: 1730Mb, iter_time: 0.166s, data_time: 0.001s, total_loss: 3.0, iou_loss: 1.8, l1_loss: 0.0, conf_loss: 0.7, cls_loss: 0.5, lr: 9.955e-03, size: 320, ETA: 16:27:11
2021-07-23 14:47:13 | INFO | yolox.core.trainer:237 - epoch: 9/100, iter: 2100/3905, mem: 1730Mb, iter_time: 0.165s, data_time: 0.001s, total_loss: 2.9, iou_loss: 1.8, l1_loss: 0.0, conf_loss: 0.6, cls_loss: 0.5, lr: 9.954e-03, size: 320, ETA: 16:27:02
2021-07-23 14:47:21 | INFO | yolox.core.trainer:237 - epoch: 9/100, iter: 2150/3905, mem: 1730Mb, iter_time: 0.166s, data_time: 0.001s, total_loss: 3.3, iou_loss: 1.9, l1_loss: 0.0, conf_loss: 1.0, cls_loss: 0.5, lr: 9.954e-03, size: 320, ETA: 16:26:54
2021-07-23 14:47:30 | INFO | yolox.core.trainer:237 - epoch: 9/100, iter: 2200/3905, mem: 1730Mb, iter_time: 0.177s, data_time: 0.002s, total_loss: 2.3, iou_loss: 1.3, l1_loss: 0.0, conf_loss: 0.6, cls_loss: 0.4, lr: 9.954e-03, size: 320, ETA: 16:26:51
2021-07-23 14:47:38 | INFO | yolox.core.trainer:237 - epoch: 9/100, iter: 2250/3905, mem: 1730Mb, iter_time: 0.166s, data_time: 0.001s, total_loss: 2.9, iou_loss: 1.7, l1_loss: 0.0, conf_loss: 0.7, cls_loss: 0.5, lr: 9.953e-03, size: 320, ETA: 16:26:43
2021-07-23 14:47:47 | INFO | yolox.core.trainer:237 - epoch: 9/100, iter: 2300/3905, mem: 1730Mb, iter_time: 0.168s, data_time: 0.001s, total_loss: 2.4, iou_loss: 1.5, l1_loss: 0.0, conf_loss: 0.5, cls_loss: 0.4, lr: 9.953e-03, size: 320, ETA: 16:26:36
------------------------stopped here--------------------

是RAM持续增长，然后溢出，导致程序停止？

GOATmessi8 · 2021-07-23T07:34:13Z

#91 Someone meets the same issue. We have no idea what happened now as we cannot reproduce this bug. Could you set the num_workers to 0 and try it again?

JinYAnGHe · 2021-07-23T07:48:55Z

#91 Someone meets the same issue. We have no idea what happened now as we cannot reproduce this bug. Could you set the num_workers to 0 and try it again?

I have tried. The result is same with #91, the RAM also continues to grow. Just slower than num_worker = 4.

GOATmessi8 · 2021-07-23T07:52:59Z

What's the version of your torch? And could you provide more information about your env so we can reproduce this problem? Thx~

VantastyChen · 2021-07-23T08:03:36Z

同样遇到这个问题，训练自定义coco类型数据集时发现内存占用会不断增长直至溢出，环境是ubuntu18.04 anaconda环境，pytorch版本1.8.1， py3.7_cuda10.2_cudnn7.6.5_0，然后单卡、四卡都会出现这个问题。

JinYAnGHe · 2021-07-23T08:07:46Z

THx too~
My ENV:
Ubuntu 16.04
CUDA 10.2
RTX 2080Ti
python 3.8
torch 1.8.0
torchvision 0.9.0

By the way：
或许你们可以提供一个更加兼容的requirement.txt，包含每个包的版本，可以兼容的环境等等？

JinYAnGHe · 2021-07-23T08:09:45Z

同样遇到这个问题，训练自定义coco类型数据集时发现内存占用会不断增长直至溢出，环境是ubuntu18.04 anaconda环境，pytorch版本1.8.1， py3.7_cuda10.2_cudnn7.6.5_0，然后单卡、四卡都会出现这个问题。

我训练的也是自定义的COCO格式数据集，不过好像跟数据集没关系 #91 是VOC

Hiwyl · 2021-07-23T08:48:01Z

我训练自定义的VOC格式数据，也遇到了同样的问题，在204/300处断了

GOATmessi8 · 2021-07-23T08:56:51Z

OK, keep this issue opening and we'll try to reproduce it first.

THx too~
My ENV:
Ubuntu 16.04
CUDA 10.2
RTX 2080Ti
python 3.8
torch 1.8.0
torchvision 0.9.0

By the way：
或许你们可以提供一个更加兼容的requirement.txt，包含每个包的版本，可以兼容的环境等等？

Thx for your advice and we are still finding the oldest working version for each package.

JinYAnGHe · 2021-07-23T09:02:20Z

OK, keep this issue opening and we'll try to reproduce it first.

THx too~
My ENV:
Ubuntu 16.04
CUDA 10.2
RTX 2080Ti
python 3.8
torch 1.8.0
torchvision 0.9.0
By the way：
或许你们可以提供一个更加兼容的requirement.txt，包含每个包的版本，可以兼容的环境等等？

Thx for your advice and we are still finding the oldest working version for each package.

Thank you for providing such a great project and waiting for the good news.

GOATmessi8 · 2021-07-23T09:08:50Z

@JinYAnGHe @Hiwyl @VantastyChen How much RAM do you have for training?

VantastyChen · 2021-07-23T09:12:19Z

@JinYAnGHe @Hiwyl @VantastyChen How much RAM do you have for training?

32G RAM，bs为32，yolox-l网络，开始训练时是占用14G，然后慢慢增加。

JinYAnGHe · 2021-07-23T09:24:43Z

@ruinmessi 128GB

GOATmessi8 · 2021-07-23T10:51:48Z

We have reproduced this problem and are trying to fix it now.

GOATmessi8 · 2021-07-26T08:09:16Z

Hey, guys! We currently do not fix the memory leak issues, but we change the multiprocess backend from spawn to subprocess. In our test, although memory would still increase during training, the training process would not crash. So we recommend you to pull the latest update and reinstall yolox, and then retry your exp.

VantastyChen · 2021-07-26T08:13:53Z

Hey, guys! We currently do not fix the memory leak issues, but we change the multiprocess backend from spawn to subprocess. In our test, although memory would still increase during training, the training process would not crash. So we recommend you to pull the latest update and reinstall yolox, and then retry your exp.

Thx.

shachargluska · 2021-07-26T08:31:53Z

Do you have a specific setup in which this problem doesn't occur?
I use nvcr.io/nvidia/pytorch:21.06-py3 docker image and able to reproduce the problem.
torch version: 1.9.0a0+c3d40fd
cuda: 11.3, V11.3.109
cudnn: 8.2.1

GOATmessi8 · 2021-07-27T02:57:18Z

Sorry, guys.... we found some error in the above updates and have to revert to the original spawn version. Currently, we are rewriting the whole dataloader to skip this bug.

yonomitt · 2021-07-27T06:21:30Z

@ruinmessi locally, I reworked the COCODataset to create numpy arrays of all the info the COCO class provides upfront. When I tested this version of the COCODataset by itself using this notebook:

https://gist.github.com/mprostock/2850f3cd465155689052f0fa3a177a50

I see that the original version "leaks" memory and that my numpy array-based one does not. However, when run it in YOLOX, I still get the memory leak. So maybe something downstream like the MosaicDetection is also contributing to this problem?

F0xZz · 2021-07-28T06:30:51Z

Firstly Thank you for you awesome work.And sorry to bother the author, I was wanna to know the which part of the code has cause the problem ,you said got know the reason of the problem.

GOATmessi8 · 2021-07-28T06:32:08Z

@yonomitt @VantastyChen @JinYAnGHe Finally.... we fix the bug and currently the training memory curve is like (yolox-tiny with 128 batchsize in 8 gpus):

GOATmessi8 · 2021-07-28T06:34:59Z

Firstly Thank you for you awesome work.And sorry to bother the author, I was wanna to know the which part of the code has cause the problem ,you said got know the reason of the problem.

Well, we found two major issues. One is our coco dataset: we must load the whole annos during the initial; Another is about loss log: we have to detach and pick the loss value instead of the single tensor for log.

F0xZz · 2021-07-28T06:39:23Z

Firstly Thank you for you awesome work.And sorry to bother the author, I was wanna to know the which part of the code has cause the problem ,you said got know the reason of the problem.

Well, we found two major issues. One is our coco dataset: we must load the whole annos during the initial; Another is about loss log: we have to detach and pick the loss value instead of the single tensor for log.

Thanks for you awesome work !

VantastyChen · 2021-07-28T06:43:34Z

Firstly Thank you for you awesome work.And sorry to bother the author, I was wanna to know the which part of the code has cause the problem ,you said got know the reason of the problem.

Well, we found two major issues. One is our coco dataset: we must load the whole annos during the initial; Another is about loss log: we have to detach and pick the loss value instead of the single tensor for log.

Thanks!

FatmaMazen2021 · 2022-01-09T08:44:40Z

Firstly Thank you for you awesome work.And sorry to bother the author, I was wanna to know the which part of the code has cause the problem ,you said got know the reason of the problem.

Well, we found two major issues. One is our coco dataset: we must load the whole annos during the initial; Another is about loss log: we have to detach and pick the loss value instead of the single tensor for log.

Could you kindly tell me how you implemented this part?
I am getting stuck in this problem

PurvangL · 2023-05-10T20:25:49Z

@GOATmessi7 I am facing similar memory leak issue. Could you provide your solution here? Thanks

GOATmessi8 mentioned this issue Jul 24, 2021

Killed #114

Closed

FateScript mentioned this issue Jul 26, 2021

change to subprocess backend #158

Merged

GOATmessi8 mentioned this issue Jul 26, 2021

My running outputs remain the same for a long time #53

Closed

GOATmessi8 mentioned this issue Jul 27, 2021

训练的时候，在不定的opech的时候，会出现错误ConnectionResetError: [Errno 104] Connection reset by peer #188

Closed

lavender-ling mentioned this issue Jul 27, 2021

ConnectionResetError: [Errno 104] Connection reset by peer #125

Closed

This was referenced Jul 28, 2021

昨天中午开始训练，然后一直停住不动，没有任何报错 #200

Closed

error when training #198

Closed

训练过程中程序突然被杀死 #208

Closed

wangyuan111 mentioned this issue Jul 28, 2021

训练过程中CPU内存占用一直增加 #211

Closed

GOATmessi8 mentioned this issue Jul 28, 2021

Fix(core): fix memory leak issue and switch to subprocess backend #216

Merged

Joker316701882 closed this as completed Jul 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RAM usage continues to grow and the training process stopped without error !!! #103

RAM usage continues to grow and the training process stopped without error !!! #103

JinYAnGHe commented Jul 23, 2021

GOATmessi8 commented Jul 23, 2021

JinYAnGHe commented Jul 23, 2021

GOATmessi8 commented Jul 23, 2021

VantastyChen commented Jul 23, 2021

JinYAnGHe commented Jul 23, 2021

JinYAnGHe commented Jul 23, 2021

Hiwyl commented Jul 23, 2021

GOATmessi8 commented Jul 23, 2021

JinYAnGHe commented Jul 23, 2021

GOATmessi8 commented Jul 23, 2021

VantastyChen commented Jul 23, 2021

JinYAnGHe commented Jul 23, 2021

GOATmessi8 commented Jul 23, 2021

GOATmessi8 commented Jul 26, 2021

VantastyChen commented Jul 26, 2021

shachargluska commented Jul 26, 2021

GOATmessi8 commented Jul 27, 2021

yonomitt commented Jul 27, 2021

F0xZz commented Jul 28, 2021

GOATmessi8 commented Jul 28, 2021

GOATmessi8 commented Jul 28, 2021

F0xZz commented Jul 28, 2021

VantastyChen commented Jul 28, 2021

FatmaMazen2021 commented Jan 9, 2022

PurvangL commented May 10, 2023

RAM usage continues to grow and the training process stopped without error !!! #103

RAM usage continues to grow and the training process stopped without error !!! #103

Comments

JinYAnGHe commented Jul 23, 2021

GOATmessi8 commented Jul 23, 2021

JinYAnGHe commented Jul 23, 2021

GOATmessi8 commented Jul 23, 2021

VantastyChen commented Jul 23, 2021

JinYAnGHe commented Jul 23, 2021

JinYAnGHe commented Jul 23, 2021

Hiwyl commented Jul 23, 2021

GOATmessi8 commented Jul 23, 2021

JinYAnGHe commented Jul 23, 2021

GOATmessi8 commented Jul 23, 2021

VantastyChen commented Jul 23, 2021

JinYAnGHe commented Jul 23, 2021

GOATmessi8 commented Jul 23, 2021

GOATmessi8 commented Jul 26, 2021

VantastyChen commented Jul 26, 2021

shachargluska commented Jul 26, 2021

GOATmessi8 commented Jul 27, 2021

yonomitt commented Jul 27, 2021

F0xZz commented Jul 28, 2021

GOATmessi8 commented Jul 28, 2021

GOATmessi8 commented Jul 28, 2021

F0xZz commented Jul 28, 2021

VantastyChen commented Jul 28, 2021

FatmaMazen2021 commented Jan 9, 2022

PurvangL commented May 10, 2023