-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RMM Memory Leak after running for a while [QST] #75
Comments
Are you 100% sure that you don't have any other rogue Also, you say that you get a segmentation fault on the allocation. I would expect that if you were running out of memory that you'd get an OOM error code. |
You can also enable logging at initialization of RMM with this flag: https://github.com/rapidsai/rmm/blob/branch-0.7/include/rmm/rmm_api.h#L68 This will log every allocation/free allowing you to plot your allocs/frees overtime and see if there are leaks in your calls. |
@lucafuji any response to @jrhemstad 's logging suggestion? I don't think there are sufficient details here to allow us to repro locally. |
Yes, let me turn on logging to try |
here is the log I got
It seems every time when I run one query, the memory keeps increasing which seems some part of the memory should be freed but not. |
@harrism ^^ |
Hi, @harrism I think the problem should be within PooledAllocation when I initialize the RMM with this option, no memory leak happens
|
I believe what is occurring is a limitation of cnmem, the allocator underlying RMM. If CNMem exceeds it's initial pool size, it grows by individual calls to cudaMalloc that cannot be merged into a larger pool. These cudaMalloc calls are the size of the calls to cnmemMalloc, not larger. Therefore if the application exceeds the initial pool size and then makes a lot of small allocations, you end up with a VERY fragmented upper memory, which can't be defragmented. If the application exceeds the initial pool size and then makes a few LARGE allocations, the upper memory is not as fragmented, so can be used effectively for sub allocation. I think cuDF typically hits the latter case, but AresDB is hitting the former. A workaround might be to initialize with a large fraction of the total GPU memory, e.g. 75% or 95%. (Just replace 0 in the Ultimately I would like to redesign the allocator to be smarter, and have a parameter to control growth steps. Or it could fall back to cudaMalloc when the pool is exceeded, so at least freeing the small allocations will make that memory usable again. |
I'll keep this issue open as a placeholder for pool growth redesign for now. |
Here is a work around that could be implemented. It is based on ideas in
the PyTorch memory allocator.
When the pool can't fill a request and that cudaMalloc fail to extend the
allocation, what about deleting all the cached allocation that aren't used,
but then call cudaMalloc with the sum of all those freed allocation?
This would be the equivalent of a slower coalescing of those free smaller
bloc.
But this will never be as good as @harrism suggestion, just could make it
less frequently needed, but would have some extra cost.
…On Thu, May 9, 2019 at 8:45 PM Mark Harris ***@***.***> wrote:
I'll keep this issue open as a placeholder for pool growth redesign for
now.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#75 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AABMF65AMNYHXKMFYUFC5OTPUTAUJANCNFSM4HEWU3TA>
.
|
All of the above require modifying or replacing cnmem, so no matter what it's a big undertaking. |
What is your question?
AresDB integrated with RMM last week and tried to run it under staging for a while.
We used pooled memory management and default stream for memory allocation.
After 30 minutes, it seems all memory of one GPU card is exhausted and a segmentation fault happens in next memory allocation.
I don't think there are any memory leaks in our code since previously when we call cudaMalloc/cudaFree, it works.
Here is the link to our code
https://github.com/uber/aresdb/blob/master/memutils/memory/rmm_alloc.cu
Thank you so much!
The text was updated successfully, but these errors were encountered: