Finer control over GPU memory (access to cu*Destroy from Python) #370

CarlosRDomin · 2022-08-08T07:19:44Z

Is your feature request related to a problem? Please describe.
We recently identified a (GPU) memory leak in a routine that creates a new cuda stream on a given context every time it is called. Even though the scope of the stream variable is constrained to this function, it doesn't seem to be garbage collected. Every N times the function is called, the total GPU memory allocated (measured through mem_get_info() or nvidia-smi) increases by a small amount (e.g. 2MB with N=16). After enough times, we completely starve our GPU's memory. See details on a minimal example to replicate in the Additional context section.

Describe the solution you'd like
I'd like the destructor of Stream objects to be exposed (e.g. through an explicit destroy method), so that we could force free that memory, or a mechanism to ensure streams out of scope are automatically freed.

Describe alternatives you've considered
So far, the only approach that seems to get the job done (although not ideal) is to detach the context and create a new one every once in a while. However, this has latency and performance drops associated with it, and involves having to reinitialize any other tasks that were running in that context.

Additional context
Minimal example to reproduce:

import gc
from pycuda import driver as cuda
cuda.init()
ctx = cuda.Device(0).make_context()

def create_streams(N=16):
    return [cuda.Stream() for _ in range(N)]

def print_memory():
    #ctx.push()
    free, total = cuda.mem_get_info()
    print(f"Used: {(total - free) / (1024**2)}MB")
    #ctx.pop()

print_memory()  # x MB
# Allocate some streams
streams = create_streams(N=10)
print_memory()  # x+2 MB
# Allocate some more
streams.extend(create_streams(N=20))
print_memory()  # x+4 MB
# Delete all streams
del streams
gc.collect()  # Even trying to force the garbage collector won't do anything
print_memory()  # Still x+4 MB instead of x MB :(

ctx.pop()

prints out:

Used: 5651.25MB
Used: 5653.25MB
Used: 5655.25MB
Used: 5655.25MB

The text was updated successfully, but these errors were encountered:

inducer · 2022-08-09T20:26:36Z

Thanks for this request! I would be happy to consider a PR that adds this. This should be simple to do. Within the stream object here, simply track whether the stream has already been destroyed, and make sure to raise an appropriate exception when handle is called:

pycuda/src/cpp/cuda.hpp

Lines 994 to 1041 in db77dc7

    
             class stream : public boost::noncopyable, public context_dependent 
        
             { 
        
               private: 
        
                 CUstream m_stream; 
        
               public: 
        
                 stream(unsigned int flags=0) 
        
                 { CUDAPP_CALL_GUARDED(cuStreamCreate, (&m_stream, flags)); } 
        
                 ~stream() 
        
                 { 
        
                   try 
        
                   { 
        
                     scoped_context_activation ca(get_context()); 
        
                     CUDAPP_CALL_GUARDED_CLEANUP(cuStreamDestroy, (m_stream)); 
        
                   } 
        
                   CUDAPP_CATCH_CLEANUP_ON_DEAD_CONTEXT(stream); 
        
                 } 
        
                 void synchronize() 
        
                 { CUDAPP_CALL_GUARDED_THREADED(cuStreamSynchronize, (m_stream)); } 
        
                 CUstream handle() const 
        
                 { return m_stream; } 
        
                 intptr_t handle_int() const 
        
                 { return (intptr_t) m_stream; } 
        
           #if CUDAPP_CUDA_VERSION >= 3020 
        
                 void wait_for_event(const event &evt); 
        
           #endif 
        
                 bool is_done() const 
        
                 { 
        
                   CUDAPP_PRINT_CALL_TRACE("cuStreamQuery"); 
        
                   CUresult result = cuStreamQuery(m_stream); 
        
                   switch (result) 
        
                   { 
        
                     case CUDA_SUCCESS: 
        
                       return true; 
        
                     case CUDA_ERROR_NOT_READY: 
        
                       return false; 
        
                     default: 
        
                       CUDAPP_PRINT_ERROR_TRACE("cuStreamQuery", result); 
        
                       throw error("cuStreamQuery", result); 
        
                   } 
        
                 } 
        
             };

CarlosRDomin · 2022-08-15T09:49:48Z

After some more investigation, I was actually able to identify the cause of our memory leak! Perhaps this is already documented somewhere but we weren't explicitly pushing the current CUDA context at the time that Python's garbage collector would delete the stream, and therefore cuStreamDestroy was never getting called.

The pycuda.Stream's destructor was in fact getting correctly called after our function returned, but for some reason this exception was silently thrown (so we never found out):

pycuda/src/cpp/cuda.hpp

Lines 967 to 969 in db77dc7

    
           if (boost::this_thread::get_id() != m_context->thread_id()) 
        
             throw pycuda::cannot_activate_out_of_thread_context( 
        
                 "cannot activate out-of-thread context");

I assume there's a good reason to raise the exception instead of pushing the context from the current thread, so we implemented a method on our end to destroy each Stream we create by wrapping the del self.stream with a context push/pop. Now our GPU memory usage is as flat as a pancake :)

I'm happy to contribute to any docs if you believe this could be helpful to someone else, otherwise I'm happy to close this issue.

CarlosRDomin · 2022-08-15T12:33:14Z

For any reader that comes across this, here's a minimal example to reproduce our leak (and how to fix it):

import gc
from pycuda import driver as cuda
from threading import Thread

cuda.init()
ctx = cuda.Device(0).make_context()
ctx.pop()

def create_streams(N=16):
    ctx.push()
    streams = [cuda.Stream() for _ in range(N)]
    ctx.pop()
    return streams

def print_memory():
    ctx.push()
    free, total = cuda.mem_get_info()
    print(f"{(total - free) / (1024 ** 2)}MB")
    ctx.pop()

def main_loop(N=10, push_ctx=False):
    for i in range(N):
        print(f"-- {i=} --")
        # Create some CUDA streams
        streams = create_streams()
        # Delete them (wrap the deletion with a context push/pop if selected)
        if push_ctx:
            ctx.push()
        del streams
        if push_ctx:
            ctx.pop()
        # Print memory usage. If there's a leak, this should increase on every iter
        print_memory()
        print("-" * 10)

# 1) Create and delete streams several times in the main thread.
# This doesn't leak memory because the stream destructor will be called
# from the same thread that created the context, so pycuda handles it nicely
main_loop()

# 2) Repeat step 1 from a separate thread.
# This will leak memory because the stream destructor isn't able to call
# cuStreamDestroy (isn't able to activate the context)
t = Thread(target=main_loop)
t.start()
t.join()

# 3) Repeat step 2 but pushing the context before deleting the streams. NO LEAK
t = Thread(target=main_loop, kwargs={"push_ctx": True})
t.start()
t.join()

Bottom line is: if you use multi-threading, remember to push the context before deleting the stream(s)!

inducer · 2022-08-15T23:57:30Z

Thanks for following up! I took a look why there is no warning about the silenced exception.

Turns out it's getting caught here:

pycuda/src/cpp/cuda.hpp

Line 1010 in a25ed98

CUDAPP_CATCH_CLEANUP_ON_DEAD_CONTEXT(stream);

and then simply silenced here:

pycuda/src/cpp/cuda.hpp

Lines 161 to 171 in a25ed98

    
           #define CUDAPP_CATCH_CLEANUP_ON_DEAD_CONTEXT(TYPE) \ 
        
             catch (pycuda::cannot_activate_out_of_thread_context) \ 
        
             { } \ 
        
             catch (pycuda::cannot_activate_dead_context) \ 
        
             { \ 
        
               /* PyErr_Warn( \ 
        
                   PyExc_UserWarning, #TYPE " in dead context was implicitly cleaned up");*/ \ 
        
             } 
        
             // In all likelihood, this TYPE's managing thread has exited, and 
        
             // therefore its context has already been deleted. No need to harp 
        
             // on the fact that we still thought there was cleanup to do.

That struck me as a bad idea, given that that warning likely could have saved you a few hours of grief.

https://gitlab.tiker.net/inducer/pycuda/-/merge_requests/80 adds a warning in that case.

CarlosRDomin added the enhancement label Aug 8, 2022

totaam mentioned this issue Aug 15, 2022

nvjpeg decoder Xpra-org/xpra#3504

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finer control over GPU memory (access to cu*Destroy from Python) #370

Finer control over GPU memory (access to cu*Destroy from Python) #370

CarlosRDomin commented Aug 8, 2022 •

edited

Loading

inducer commented Aug 9, 2022

CarlosRDomin commented Aug 15, 2022

CarlosRDomin commented Aug 15, 2022

inducer commented Aug 15, 2022

Finer control over GPU memory (access to cu*Destroy from Python) #370

Finer control over GPU memory (access to cu*Destroy from Python) #370

Comments

CarlosRDomin commented Aug 8, 2022 • edited Loading

inducer commented Aug 9, 2022

CarlosRDomin commented Aug 15, 2022

CarlosRDomin commented Aug 15, 2022

inducer commented Aug 15, 2022

CarlosRDomin commented Aug 8, 2022 •

edited

Loading