Trainning stopped after 31 epochs without printing any error msg #91

deep-practice · 2021-07-23T00:59:57Z

Ubuntu18.04
cuda10.1
pytorch 1.7.1

GOATmessi8 · 2021-07-23T03:52:54Z

We guess something wrong in the multi-threading of the dataloader and your env. We are planning to add a subprocess launcher as an alternative, but for now, you can check your nccl configure and try to resume the training with --resume

lidehuihxjz · 2021-07-23T06:19:20Z

We guess something wrong in the multi-threading of the dataloader and your env. We are planning to add a subprocess launcher as an alternative, but for now, you can check your nccl configure and try to resume the training with --resume

训练VOC数据集时也遇到这个问题，观察发现是内存一直在增大，内存耗完后退出了，是存在内存泄露泄露吗？

GOATmessi8 · 2021-07-23T06:21:59Z

@lidehuihxjz 你可以尝试把num_workers调小试试

lidehuihxjz · 2021-07-23T06:34:17Z

@lidehuihxjz 你可以尝试把num_workers调小试试

你好，num_workers从4调为2，仍存在内存使用一直增长。
使用VOC2007训练yolox_nano，除了dataloader相关部分切换为VOC，其它代码未改动。

GOATmessi8 · 2021-07-23T06:36:49Z

是显存增长还是内存增长？还有其他异常的信息没

lidehuihxjz · 2021-07-23T06:39:51Z

是显存增长还是内存增长？还有其他异常的信息没

是内存增长，没有报告其它异常信息呢。
单卡batchsize 16，十几分钟大约增长1G内存。

GOATmessi8 · 2021-07-23T06:46:47Z

是显存增长还是内存增长？还有其他异常的信息没

是内存增长，没有报告其它异常信息呢。
单卡batchsize 16，十几分钟大约增长1G内存。

那你先把workers调成0再看下什么情况吧。如果还有泄露，可能是dataloader那块有些问题

JinYAnGHe · 2021-07-23T07:06:38Z

We guess something wrong in the multi-threading of the dataloader and your env. We are planning to add a subprocess launcher as an alternative, but for now, you can check your nccl configure and try to resume the training with --resume

训练VOC数据集时也遇到这个问题，观察发现是内存一直在增大，内存耗完后退出了，是存在内存泄露泄露吗？

是的我遇到了同样的问题

lidehuihxjz · 2021-07-23T07:18:41Z

是显存增长还是内存增长？还有其他异常的信息没

是内存增长，没有报告其它异常信息呢。
单卡batchsize 16，十几分钟大约增长1G内存。

那你先把workers调成0再看下什么情况吧。如果还有泄露，可能是dataloader那块有些问题

num_workers设置为0，内存还是增长，只是增长速度变慢。

FateScript · 2021-08-06T09:55:41Z

This issue is solved in #216

ladyxuxu · 2022-07-06T02:06:34Z

the same error still occur ,version: yolox 0.3.0

GOATmessi8 mentioned this issue Jul 23, 2021

RAM usage continues to grow and the training process stopped without error !!! #103

Closed

Joker316701882 mentioned this issue Jul 25, 2021

RuntimeError: CUDA out of memory. Tried to allocate 40.00 MiB (GPU 0; 6.00 GiB total capacity; 177.88 MiB already allocated; 0 bytes free; 4.71 GiB reserved in total by PyTorch) #124

Open

FateScript closed this as completed Aug 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trainning stopped after 31 epochs without printing any error msg #91

Trainning stopped after 31 epochs without printing any error msg #91

deep-practice commented Jul 23, 2021

GOATmessi8 commented Jul 23, 2021

lidehuihxjz commented Jul 23, 2021

GOATmessi8 commented Jul 23, 2021

lidehuihxjz commented Jul 23, 2021

GOATmessi8 commented Jul 23, 2021

lidehuihxjz commented Jul 23, 2021

GOATmessi8 commented Jul 23, 2021

JinYAnGHe commented Jul 23, 2021

lidehuihxjz commented Jul 23, 2021

FateScript commented Aug 6, 2021

ladyxuxu commented Jul 6, 2022

Trainning stopped after 31 epochs without printing any error msg #91

Trainning stopped after 31 epochs without printing any error msg #91

Comments

deep-practice commented Jul 23, 2021

GOATmessi8 commented Jul 23, 2021

lidehuihxjz commented Jul 23, 2021

GOATmessi8 commented Jul 23, 2021

lidehuihxjz commented Jul 23, 2021

GOATmessi8 commented Jul 23, 2021

lidehuihxjz commented Jul 23, 2021

GOATmessi8 commented Jul 23, 2021

JinYAnGHe commented Jul 23, 2021

lidehuihxjz commented Jul 23, 2021

FateScript commented Aug 6, 2021

ladyxuxu commented Jul 6, 2022