layout | title | nav_order | parent |
---|---|---|---|
page |
Memory Debugging |
10 |
Developer Overview |
The JVM uses a garbage collector that is generally triggered based off of the JVM heap running out of free space. This means that if we rely on GC to free unused GPU memory or off heap memory we are going to effectively leak that memory. This is because the JVM's GC will not be triggered if we run out of GPU memory or off heap memory. Instead, we make everything that can hold a reference to memory that is not tracked directly by the JVM an AutoClosable. In many cases we extend this to include reference counting too. But all of this results in a lot of complexity and potentially bugs. To deal with these bugs inside the CUDF java code we track each of these objects and if the garbage collector causes the memory to be freed instead of a proper close we will output a warning like the following in the logs.
ERROR DeviceMemoryBuffer: A DEVICE BUFFER WAS LEAKED (ID: 5 30a000200)
ERROR HostMemoryBuffer: A HOST BUFFER WAS LEAKED (ID: 6 7fb62b07f570)
There are different messages for different resources that have different amounts of information. Memory buffers generally will include the type of buffer that leaked, an ID for the buffer, and the last number is a HEX representation of the physical address. These are here to help identify the same buffer in other contexts.
In the case of a ColumnVector
, or a HostColumnVector
the messages are slightly different.
ERROR ColumnVector: A DEVICE COLUMN VECTOR WAS LEAKED (ID: 15 7fb5f94d8fa0)
ERROR HostColumnVector: A HOST COLUMN VECTOR WAS LEAKED (ID: 19)
Here the ID can still be used for cross-referencing, in some cases. But the host column vector
has no address. That is because it is made up of one or more HostMemoryBuffer
, and it is here
to help debug why one of them leaked as it holds references to them. A ColumnVector
includes
an address, which is the address of the C++ cudf::column
that actually holds the memory at
the C++ level. We generally don't reach into the memory held by a cudf::column
directly, but
this allows us to track it.
Sadly there are some resources that we currently don't track. The biggest one is Scalar
values,
followed closely by Table
. Table
does not directly allocate any device memory, but it references
one or more C++ class instances off heap.
Once a leak is detected we can turn on a special system property on the command line,
-Dai.rapids.refcount.debug=true
, to get a lot more information about the leak. We encourage
you to also enable assertions in these cases too, -ea
. The reason this is off by default is that
it keeps track of the stack traces each time a buffer or column had its reference count
incremented or decremented. Getting a stack trace is an expensive operation, and it can also
use up a lot of the JVM heap, so we have it off by default. The output of this looks like
ERROR ColumnVector: A DEVICE COLUMN VECTOR WAS LEAKED (ID: 7 7f756c1337e0)
23/12/27 20:21:32 ERROR MemoryCleaner: Leaked vector (ID: 7): 2023-12-27 20:21:20.0002 GMT: INC
java.lang.Thread.getStackTrace(Thread.java:1564)
ai.rapids.cudf.MemoryCleaner$RefCountDebugItem.<init>(MemoryCleaner.java:336)
ai.rapids.cudf.MemoryCleaner$Cleaner.addRef(MemoryCleaner.java:90)
ai.rapids.cudf.ColumnVector.incRefCountInternal(ColumnVector.java:298)
ai.rapids.cudf.ColumnVector.<init>(ColumnVector.java:120)
or
ERROR MemoryCleaner: double free ColumnVector{rows=6, type=INT32, nullCount=Optional[0], offHeap=(ID: 15 0)} (ID: 15): 2023-12-27 20:12:34.0607 GMT: INC
java.lang.Thread.getStackTrace(Thread.java:1564)
ai.rapids.cudf.MemoryCleaner$RefCountDebugItem.<init>(MemoryCleaner.java:336)
ai.rapids.cudf.MemoryCleaner$Cleaner.addRef(MemoryCleaner.java:90)
...
2023-12-27 20:12:34.0607 GMT: DEC
java.lang.Thread.getStackTrace(Thread.java:1564)
ai.rapids.cudf.MemoryCleaner$RefCountDebugItem.<init>(MemoryCleaner.java:336)
ai.rapids.cudf.MemoryCleaner$Cleaner.delRef(MemoryCleaner.java:98)
ai.rapids.cudf.ColumnVector.close(ColumnVector.java:262)
...
2023-12-27 20:12:34.0607 GMT: DEC
java.lang.Thread.getStackTrace(Thread.java:1564)
ai.rapids.cudf.MemoryCleaner$RefCountDebugItem.<init>(MemoryCleaner.java:336)
ai.rapids.cudf.MemoryCleaner$Cleaner.delRef(MemoryCleaner.java:98)
ai.rapids.cudf.ColumnVector.close(ColumnVector.java:262)
...
Here an INC
indicates that the reference count was incremented and a DEC
indicates that it
was decremented. With the stack traces we can walk through the code and try to line up which DEC
s
were supposed to go with which INC
s and hopefully find out where something went wrong. Like a
double free or a leak.
The leak/double free detection and debugging is great. But it does not give us visibility into
what is actually happening with memory allocation. It does not let us see leaks at the C++ level.
These should be very rare, but we have found them in the past. It also does not let us see things
like fragmentation when we are using a pooling allocator. RMM has logging that we can enable at
startup to help us see at a very low level what is happening. spark.rapids.memory.gpu.debug
can
be set to either STDERR
or STDOUT
to see everything that is happening with the allocation.
1979932,14:21:19.998533,allocate,0x30a000000,256,0x2
1980043,14:21:32.661875,free,0x30a000000,256,0x2
The format of this is not really documented anywhere, but it uses the logging_resource_adaptor to log when an allocation succeeded or failed and when memory was freed. The current format appears to be.
Thread,Time,Action,Pointer,Size,Stream
Thread
is the thread ID (C++ not java)Time
is the time in a 24-hour format including micro second level granularityAction
is one ofallocate
,allocate failure
, orfree
.Pointer
is the address of the pointer, except for a failure where it is null.Size
is the size of the allocationStream
is the CUDA stream that the operation happened on.
Be aware that there are a lot of things to keep in mind here. Our spill framework often will
wait for an allocation to fail before it starts to spill, so seeing a allocate failure
does
not mean that something terrible happened. Size
is the size for the allocation at this point.
Because RMM allows for different layers of code in between the call to RMM and the logging the
size might be different from what the user actually requested. RMM often will try and "align" the
allocation. Meaning round it up to a multiple of a set number of bytes. We try to get the logger as
close to the actual allocator as possible. But just be careful.
Also know that the address here should correspond to the address in the leak debugging if and only
if it was a DeviceMemoryBuffer
that was allocated. In this case, where it is a cudf::column
the
pointers will not line up.
Also be aware that it is possible for us to logically slice a buffer. So you might see a large allocation, and then have buffers that show up in java that reference a subsection of that original allocation.
This cannot answer a lot of questions directly, but we can write some simple scripts to parse
the log and answer questions. We can answer questions like "What is the amount of GPU memory
allocated at any point in time?" With that we can look to see when an allocation failed and get
an idea of how bad fragmentation is by seeing how large the allocation is vs the amount of memory
that should be available on the GPU. We could go even further and track the address and size of
each allocation to see what is the maximum available buffer size. But that would assume that we
are using a pooling allocator like arena
, which cannot modify the virtual memory tables to help
fix fragmentation. We can look for memory leaks. Any address where the allocation and free calls
don't cancel each other out. You should filter out null pointers and zero size allocations as you
don't need to free a nullptr if the allocation size was 0.
We are still missing some functionality that would be helpful with debugging issues in addition to tracking Scalar and Table values.
We don't have any host memory logging like we do for RMM. This might change when/if we go to RMM for host memory allocation too, but for now if you need this you probably are going to have to write something yourself.
We don't have a good way to track spill/log what is or is not spillable. The spill framework typically will use reference counting to know when it is the only one holding a reference to memory and then make the data spillable at that point. This means that if we don't get the reference counting right we end up never spilling
We also don't have a way to log exactly what was spilled and what was read back. We can probably guess that this is happening from other logs, but it would be really nice to have a way to actually capture it.