You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When training a model consuming more memory, I noticed that my training would stop after a constant number of epochs. Upon further investigation, I found that during training / validation, the memory usage of the CPU memory (RAM) continues to rise and never get released, which results in OOM after a number of epochs. This happens while I was using the Lhotse dataloader, while I was able to train a small fast-conformer CTC model for hundreds of epochs, the same version of NeMo only is able to run the training for an XL fast-conformer CTC model for ~20 epochs. ~110 epochs if I use 1/4 of the works for the dataloader. So somehow the dataloader is not releasing memory for the data loaded.
Steps/Code to reproduce bug
shows up in XL fast-conformer CTC and medium fast-conformer CTC RNNT hybrid
standard configs for these models in terms of hyperparameters
@nithinraokhas told me to try using limit_validation_batches, use smaller duration audios, use fully_randomized which I haven't fully tested. I'll report back when these are tested but the issue persists so far
Edit: tried these suggestions, still OOM.
The text was updated successfully, but these errors were encountered:
Describe the bug
When training a model consuming more memory, I noticed that my training would stop after a constant number of epochs. Upon further investigation, I found that during training / validation, the memory usage of the CPU memory (RAM) continues to rise and never get released, which results in OOM after a number of epochs. This happens while I was using the Lhotse dataloader, while I was able to train a small fast-conformer CTC model for hundreds of epochs, the same version of NeMo only is able to run the training for an XL fast-conformer CTC model for ~20 epochs. ~110 epochs if I use 1/4 of the works for the dataloader. So somehow the dataloader is not releasing memory for the data loaded.
Steps/Code to reproduce bug
Expected behavior
Training should continue until specified stop.
Environment overview (please complete the following information)
Environment details
Additional context
GPU: A100 40G.
@nithinraok
has told me to try using limit_validation_batches, use smaller duration audios, use fully_randomized which I haven't fully tested. I'll report back when these are tested but the issue persists so farEdit: tried these suggestions, still OOM.
The text was updated successfully, but these errors were encountered: