-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trainning stopped after 31 epochs without printing any error msg #91
Comments
We guess something wrong in the multi-threading of the dataloader and your env. We are planning to add a subprocess launcher as an alternative, but for now, you can check your nccl configure and try to resume the training with |
训练VOC数据集时也遇到这个问题,观察发现是内存一直在增大,内存耗完后退出了,是存在内存泄露泄露吗? |
@lidehuihxjz 你可以尝试把num_workers调小试试 |
你好,num_workers从4调为2,仍存在内存使用一直增长。 |
是显存增长还是内存增长?还有其他异常的信息没 |
是内存增长,没有报告其它异常信息呢。 |
那你先把workers调成0再看下什么情况吧。如果还有泄露,可能是dataloader那块有些问题 |
是的我遇到了同样的问题 |
num_workers设置为0,内存还是增长,只是增长速度变慢。 |
This issue is solved in #216 |
the same error still occur ,version: yolox 0.3.0 |
Ubuntu18.04
cuda10.1
pytorch 1.7.1
The text was updated successfully, but these errors were encountered: