-
Notifications
You must be signed in to change notification settings - Fork 283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High Memory Usage in DataLoader Workers Leading to Out-of-Memory (OOM) #196
Comments
|
Meet same problems! |
Thank you for your response and for checking the logs. I've been investigating further, and I was able to mitigate the issue somewhat by reducing the number of workers in the DataLoader. Although the |
I checked our wandb logs (randomly selecting the runs larger than 300 minutes), and there's indeed the sign of leakage (mainly due to `decord`). @Luodian “I checked our wandb logs (randomly selecting the runs larger than 300 minutes), and there's indeed the sign of leakage (mainly due to what the |
Don't have time to provide a MR to fix this, but I found the issue on this line https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/llava/train/train.py#L566 Don't deepcopy the tokeniser and instead just add the token on the fly / pass an appropriate tokeniser object that you want to modify. Good luck! |
Are you sure? It seems that the deepcopy is ok there. The tokenizer will be removed after the code run out of the scope of the |
It was definitely a source of memory leakage for me. Give it a try, sadly I'm still unable to prepare the MR :( |
We add: del tokenizer and it is solved. Hardly know why, but it works. Ahhh, I guess I have to say "amazing". |
That seems like del tokenizer doesn't work for me. The memory usage still becomes higher and higher...... |
I'm experiencing high memory usage in the DataLoader workers when using a custom dataset class for lazy loading large datasets. This leads to Out-of-Memory (OOM) errors during training. I've observed that the MaxRSS (maximum resident set size) steadily increases during training, indicating potential memory leaks or improper memory management in the DataLoader or dataset preprocessing.
Error Message Example:
RuntimeError: DataLoader worker (pid XXXX) is killed by signal: Killed
Setup: Distributed training with 3 nodes, 4 GPUs per node
Memory: 512 GB RAM
Training Configuration
Here are the relevant training configurations used:
The text was updated successfully, but these errors were encountered: