-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Memory leak when passing images of different dimensions with MXNET_CUDNN_AUTOTUNE_DEFAULT #12662
Comments
@mxnet-label-bot [Bug, Memory] |
I think we discovered this before. Somehow I can't find the ticket, but I did some memory consumption graphs in the Jetson board which indicated a leak in Cuda / cudnn. We notified NVidia about this before. Maybe @DickJC123 has some more info. |
Is the memory released across executions of the script? |
@larroy, when searching for the best algorithm the memory consumption spikes, and once it is found it goes down, but not to the state it was before the search started. |
I have an explanation but I'll have to think about the best fix. The problem starts with the fact that cudnnFind() does its own workspace allocations and doesn't use MXNet's memory allocator. MXNet anticipates this by setting up a 'headroom' via MXNET_GPU_MEM_POOL_RESERVE (a percentage of total memory). I was able to run your script with repeated allocations on a 16GB GPU by setting MXNET_GPU_MEM_POOL_RESERVE=35. On a 12GB GPU, the corresponding value would be 47!! That's clearly excessive so we might have to resort to calling the 'Ex' flavor of cudnnFind, which allows for pre-screening of algos that have a workspace greater than the threshold set by the convolution instance 'workspace' param. |
I had been working up a PR with a number of improvements to the way MXNet uses cudnnFind. In the PR's new serialized approach to running cudnnFind, a fix to your problem became possible. Please see #12804 |
Can this be closed due to the previous fix by @DickJC123 ? |
Yes, I consider the issue fixed. I tried to reproduce it with |
How did you measure the memory? Do you think it makes sense to add a regression test? |
As part of #12804, I took the minimal example code provided here and turned it into the initially failing unittest test_gluon_gpu.py:test_large_models. To make the test universal, I had its operation scale with the amount of available memory, as provided by a new Python API addition gpu_memory_info():
After CI showed the test failing, I pushed the fix that corrected the issue. So in summary, we have a new function to measure memory, and a new unittest for this issue. |
Description
I have noticed, that if I use MXNET_CUDNN_AUTOTUNE_DEFAULT=1 with big image dimensions (1900x1900), then after a forward pass a lot of GPU memory got consumed and never released. Autotune gets Out of Memory exception, if I try to pass another image after the first one with also big, but different dimensions (1800x1800).
The image dimensions are smaller, so my assumption is that since 1900x1900 got processed then 1800x1800 should also be processed, because it takes less memory. But it is actually not the case, because after the first image processing some of the GPU memory is not released.
The main question for me is why GPU memory is not released once the first image is processed? It seems like something is holding it. I think there is a memory leak or some sort of cache, which is never released.
Environment info (Required)
Package used (Python/R/Scala/Julia):
Python
Error Message:
Minimum reproducible example
Steps to reproduce
What have you tried to solve it?
Setting MXNET_CUDNN_AUTOTUNE_DEFAULT=0 seems to solve the problem. The inference time increases slightly, but memory seems like properly reused:
The text was updated successfully, but these errors were encountered: