You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been trying to train YOLO-NAS on the COCO dataset, but the training process stops after running just a few epochs, with the Killed message appearing in the standard output.
According to the log file, the GPU memory usage seems to fluctuate significantly. Could there possibly be a memory leak causing this issue?
For reference, I've been using the following command to train the model:
It looks you are using quite old torch build with old CUDA. This may not be the root cause of the problem, but anyway I suggest you trying the recent torch versions.
We were able to train yolo nas in different hardware & OS and didn't notice such issues.
Try looking at the dmesg output - maybe you will find additional details why the process was Killed.
Suggestions:
Install latest SG
Install latest torch
Try running with num_workers=0 (If it works - could indicate RAM issues (how much memory you have installed?)
💡 Your Question
Hello, thanks for this useful repository !
I've been trying to train YOLO-NAS on the COCO dataset, but the training process stops after running just a few epochs, with the
Killed
message appearing in the standard output.According to the log file, the GPU memory usage seems to fluctuate significantly. Could there possibly be a memory leak causing this issue?
For reference, I've been using the following command to train the model:
Versions
The text was updated successfully, but these errors were encountered: