-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory (CPU and GPU) leaks during the 1st epoch #1510
Comments
by leak you mean tensors build up during epoch 1? but after that the memory stays constant? ie: there is no more "leak" for epochs >= 2? |
Yes, the memory stays constant after 1st epoch ends (although, the number of tensors begins increasing again) |
The whole output of a training step is stored. I think there is a mistake in |
Yes, I agreed, that with |
Oh, no sorry, just checked: it will be a leak even if we perform |
can you submit a PR? i thought we took care of all the metrics. |
We take care of it in process_output() but then in optimizer_closure() we return original output_dict again. |
What about fp32-mode? There is no leak on the GPU in such a case. What could be the reason? |
@AratorField , do you mean this? Here is a list that stores all train step outputs during the epoch. |
but we detach everything. how could it leak? |
Yes, but they (tensors) are still on the GPU after detach. So, in case of long epochs or huge outputs from the training step, the GPU memory will blow after some time. |
We can create something like _recursive_item() or remove keys loss, log, progress_bar from batch_output before appending to outputs. |
Is it in general a good practice to store values during the epoch? The size of such a bookkeeping list is undetermined in the general case. I mean, that one could have almost an infinite epoch and sooner or later he'll be faced with OOM (GPU or CPU, it does not matter). |
the thing is that .item() slows things down. The tradeoff is that we plug the memory leak but slow things down. |
There is no reason to store loss, log and progress_bar for the whole epoch. |
Maybe it's possible to introduce a flag, which shows, should we store tensors in this list during an epoch or not. |
I even have no |
change it to |
Thank you, I'll patch it locally for now. |
I am using pytorch-lightning 1.5.10 Is there a patch for 1.5.10? |
🐛 Bug
Hello.
This memory leak occurs during the first epoch. If one has a large epoch time (I had > 10 days), the OOM error will come. It's interesting, that in precision=16 mode, it leaks out on the GPU and the CPU both. If we switch amp optimization off (precision=32), the leak goes only on the CPU.
Also, I checked the number of tensors, which are tracked by the garbage collector. And it appeared to be linearly increasing during the first epoch, and then (on the 2nd epoch starts), it falls to the initial value and begins increasing again.
Let me provide the plots:
Experiment 1:
amp_level='O2', precision=16
The number of tensors, tracked by garbage collector
GPU (the 2nd in my case) usage, tracked by pytorch-lightning
CPU memory usage by the process (bytes)
Experiment 2:
amp_level=None, precision=None
The number of tensors, tracked by garbage collector
GPU (the 2nd in my case) usage, tracked by pytorch-lightning
CPU memory usage by the process (bytes)
As you can see, both cases have a CPU leak. The "amp"-case also has a GPU leak.
Also, it's clear, that such leaky behavior stops when the 2nd epoch starts.
On these plots, the 2nd epoch starts on the 2nd "saw claw" of the "Num-of-tensors" plot.
Also, there is another observation: the speed of tensors number increasing is 1001. And this is my forward pass method:
Here I return exactly 1001 tensor: one for
loss
and 1000 forlog
.In my real experiments I had only 3 tensors. It took ~2-3 days to get OOM. But in the current example (see
To Reproduce
) it will crash much faster.To Reproduce
Steps to reproduce the behavior:
Code sample
(this script has no arguments, so change needed values manually in script).Code sample
https://gist.github.com/alexeykarnachev/47de06b93a717ab0664eded42ed2826a
Expected behavior
The number of tensors, GPU and CPU memory does not increase during the training.
Environment
PyTorch version: 1.4.0
OS: Ubuntu 16.04.6 LTS
Python version: 3.7
Versions of relevant libraries:
[pip] numpy==1.18.1
[pip] pytorch-lightning==0.7.3
[pip] torch==1.4.0
[pip] torchvision==0.5.0
Additional context
Sorry for so messy flow of the information, but I don't know, how to structure it more clearly.
The text was updated successfully, but these errors were encountered: