-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(extensions/nanoarrow_device): Draft DeviceArray interface #205
Conversation
Codecov Report
@@ Coverage Diff @@
## main #205 +/- ##
==========================================
- Coverage 87.64% 87.17% -0.48%
==========================================
Files 63 66 +3
Lines 9789 10061 +272
==========================================
+ Hits 8580 8771 +191
- Misses 1209 1290 +81
... and 2 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
3c95a4d
to
dbb4404
Compare
static void ArrowDeviceCudaAllocatorFree(struct ArrowBufferAllocator* allocator, | ||
uint8_t* ptr, int64_t old_size) { | ||
if (ptr != NULL) { | ||
cudaFree(ptr); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most GPU libraries / frameworks have their own memory pool / memory management implementations that are often asynchronous (and are ordered by CUDA streams) where they won't be able to benefit from this implementation. This is generally true for most operations: free, alloc, realloc, memset, memcpy, etc.
I'm not sure if we need an actual implementation to live within nanoarrow or if we can just define an interface for downstream libraries to implement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An early version of ArrowDeviceBufferXXX()
functions had a sync_event*
argument, which I removed before I saw cudaMemcpyAsync
and friends in the documentation. I don't know if that's the perfect interface, but the part of nanoarrow's generic "copy this array to the device" implementation would benefit a lot since essentially all of those buffers can be copied in parallel.
|
||
// Pointer vs. not pointer...is there memory ownership to consider here? | ||
cudaEvent_t* cuda_event = (cudaEvent_t*)sync_event; | ||
cudaError_t result = cudaEventSynchronize(*cuda_event); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In most situations you'd want to use cudaStreamWaitEvent
as opposed to this API as its much more efficient and doesn't unnecessarily block the CPU until things are done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that is what a library doing gpu--gpu calculations would do...here I think it does need to be the blocking version (this is the method that is called before an arbitrary ArrowDeviceArray
or a slice of it is copied back to the CPU).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To call cudaStreamWaitEvent
you'd need to know what stream to wait on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this function is intended to do the bad thing and block until it's safe to do CPU things. If there's a way to avoid the sync before copying back to the CPU it could be removed.
extensions/nanoarrow_device/src/nanoarrow/nanoarrow_device_cuda.c
Outdated
Show resolved
Hide resolved
// specific language governing permissions and limitations | ||
// under the License. | ||
|
||
#include <cuda_runtime_api.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd likely be better off using the CUDA driver API here instead of the runtime API as there's much stronger forward compatibility guarantees as well as easier deployment (someone can have the driver installed but not the CUDA runtime, but not the reverse).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That does seem like a better fit (although may require implementing some reference counting of our own).
// TODO: All these buffer copiers would benefit from cudaMemcpyAsync but there is | ||
// no good way to incorporate that just yet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For what it's worth: this will likely be a blocker for most libraries / frameworks to be able to utilize things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there's a technical limitation for making ArrowDeviceArrayViewCopy()
return after having kicked off all the buffer copies and populating the ArrowDeviceArray
's sync_event
...this is mostly a personal limitation (steep learning curve for me).
} else if (device_src->device_type == ARROW_DEVICE_CUDA_HOST && | ||
device_dst->device_type == ARROW_DEVICE_CUDA_HOST && | ||
device_src->device_id == device_dst->device_id) { | ||
// Move | ||
return 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we handle the situations where the src is ARROW_DEVICE_CUDA_HOST
and dst is ARROW_DEVICE_CUDA
and vice versa?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did see cuMemHostGetDevicePointer()
, so I assume this is possible. I think it would require that ArrowDeviceArrayViewCopy()
has a device-specific implementation (probably for the best anyway).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cuMemHostGetDevicePointer()
gets a pointer to that pinned host memory that can be used from device code but doesn't actually copy any memory to device memory. As far as I know it can be used anywhere that device memory can be used, but obviously has different performance characteristics where that would likely be very unexpected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass
/// Creates a new buffer whose data member can be accessed by the GPU by | ||
/// moving an existing buffer. If NANOARROW_OK is returned, src will have | ||
/// been released or moved by the implementation and dst must be released by | ||
/// the caller. | ||
/// Implementations must check device_src and device_dst and return ENOTSUP if | ||
/// not prepared to handle this operation. | ||
ArrowErrorCode (*buffer_move)(struct ArrowDevice* device_src, struct ArrowBuffer* src, | ||
struct ArrowDevice* device_dst, struct ArrowBuffer* dst); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we specify semantics about the src and dst devices such as the src being CPU memory? etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea was that an implementation might be able to handle a few directions, although it does result in verbose method implementations. For the CUDA case the generality is somewhat useful...it can also theoretically move a buffer from CUDA_HOST to CUDA and it might be difficult to construct a method signature that captures that. That generality might also not be useful 🤷
/// \brief Copy a section of memory into a preallocated buffer | ||
/// | ||
/// As opposed to the other buffer operations, this is designed to support | ||
/// copying very small slices of memory. | ||
/// Implementations must check device_src and device_dst and return ENOTSUP if | ||
/// not prepared to handle this operation. | ||
ArrowErrorCode (*buffer_copy)(struct ArrowDevice* device_src, | ||
struct ArrowDeviceBufferView src, | ||
struct ArrowDevice* device_dst, | ||
struct ArrowDeviceBufferView dst); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same question, should we put specifics as to the semantics of the device types? Should implementations have to check the device type every time for both source and destination or have to implement multiple devices?
static ArrowErrorCode ArrowDeviceBufferGetInt32(struct ArrowDevice* device, | ||
struct ArrowBufferView buffer_view, | ||
int64_t i, int32_t* out) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be an equivalent that doesn't do the copy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand where that would be used? It's definitely suboptimal to issue copies in this way (but your suggestion of skipping validation and Keith's suggestion of leveraging async memcpy may be a workaround).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would probably be useful for device code to be able to access the value of a specific index of the array without having to perform the copy. It also allows CPU code to find the address of the specific index (pointer into non-cpu memory) that can then be used for whatever is necessary on the device side without needing to copy the value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely don't want to issue copies one by one like this even asynchronously, that would be really really bad performance wise and put significant pressure on the system via the GPU driver.
I would +1 @zeroshade's suggestion of skipping validation and generally anything that needs to introspect the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I updated the function name to ArrowDeviceArrayViewResolveBufferSizes()
since that's what it's actually doing and added ArrowDeviceArrayViewSetArrayMinimal()
that just sets array_view->buffers[i].size_bytes
to -1
if it would require a copy to calculate.
For the case of "just get me the pointer value", I don't think there needs to a be a function (array_view->buffers[i].data.as_int32 + some_logical_offset
would do it).
For the case where we copy back to the CPU, I don't see a way around copying the last int32/int64 from the offsets buffer (or else there is no way to know how many bytes of the next buffer to copy). We can possibly mitigate the impact of that by asynchronously kicking off all the tiny copies at once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, issuing two single element copies for getting the starting and ending offset into the CPU is necessary, but once we have those offsets, we should just do pointer arithmetic to get a pointer and a size to feed into a copy call.
Ideally you want to issue the two individual element copies asynchronously (potentially on different streams so they can be overlapped even though they're tiny), synchronize the stream(s) since you need those values to use in host code, and then issue the actual data copy.
if (NANOARROW_DEVICE_WITH_CUDA) | ||
find_package(CUDAToolkit REQUIRED) | ||
set(NANOARROW_DEVICE_SOURCES_CUDA src/nanoarrow/nanoarrow_device_cuda.c) | ||
set(NANOARROW_DEVICE_LIBS_CUDA CUDA::cudart) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're going to continue using the runtime, may want to use the static library instead: CUDA::cudart_static
.
Would still recommend using the driver library though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Driver library is a definite yes (just haven't gotten there yet).
static ArrowErrorCode ArrowDeviceBufferGetInt32(struct ArrowDevice* device, | ||
struct ArrowBufferView buffer_view, | ||
int64_t i, int32_t* out) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely don't want to issue copies one by one like this even asynchronously, that would be really really bad performance wise and put significant pressure on the system via the GPU driver.
I would +1 @zeroshade's suggestion of skipping validation and generally anything that needs to introspect the data.
return NANOARROW_OK; | ||
} | ||
|
||
static ArrowErrorCode ArrowDeviceBufferGetInt64(struct ArrowDevice* device, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to the int32 function here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is handled by array_view->buffer_views[i].data.as_int32 + some_index
(which would get you the pointer to an element of a buffer).
// Wait on device_array to synchronize with the CPU | ||
NANOARROW_RETURN_NOT_OK(device->synchronize_event(ArrowDeviceCpu(), device, | ||
device_array->sync_event, error)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to synchronize on the event here? This function ultimately is just responsible for setting the points in the array view from the passed in array, correct?
If so, synchronize guarantees that the data underneath the pointer is synchronized, but doesn't impact the pointers themselves at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I correctly separated the case where it's needed (it is needed before copy to CPU, correct? Or is that synchronizatio handled by cudaMemcpy()
?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all variants of cudaMemcpy
are stream ordered so you can safely use them without synchronizing the stream, but the destination data is stream ordered as well, so if you're going to operate on it from a different stream or from host code then you need to synchronize the stream in some kind of way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like we do need to do the bad thing here and wait for a sync before calling cudaMemcpy()
here for the GPU -> CPU direction (although hopefully this is now isolated such that it won't get accidentally called by somebody who does not need this).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should sync after calling the cudaMemcpy()
as opposed to before it. Otherwise, you could in theory get into a situation where that device to host copy is asynchronous (if you have pinned host memory for example) and accessing it from the CPU without synchronization is a race condition.
NANOARROW_RETURN_NOT_OK(ArrowDeviceBufferInit(device_src, buffer_view_src, device_dst, | ||
ArrowArrayBuffer(dst, i))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this buffer initialization is asynchronous then we need to set a synchronization event somewhere I think?
extensions/nanoarrow_device/src/nanoarrow/nanoarrow_device_cuda.c
Outdated
Show resolved
Hide resolved
Ok! There are definitely some holes in this implementation (notably around properly synchronizing memory copies). I'd propose that this PR get merged (and clearly marked as under development/experimental in the README) with some related changes grouped into some follow-ups:
I think it's still an open question as to whether or not this particular extension will be used/will be useful...if there is no interest in using it before the next release it can always be excluded from the source release (like the Python bindings currently are) or moved back to a PR state. |
After:
Still in very much draft form; however, it does implement arbitrary ArrowArray copy to/from
ARROW_DEVICE_METAL
,ARROW_DEVICE_CUDA
,ARROW_DEVICE_CUDA_HOST
, andARROW_DEVICE_CPU
.The nanoarrow_device extension as drafted here serves a similar purpose to nanoarrow: a means by which to create and consume the C ABI with the intention of shipping those structures to other libraries to do transformations, and potentially retrieving them again after the computation is complete. Perhaps another way to put it is that nanoarrow is designed to help at the edges: it can create and consume. Similarly, the nanoarrow_device extension is designed to help at the edges: it can copy/move arrays to and from CPU-land.
With this PR, you can currently do something like:
In concrete terms, that means we to know enough about a device to (1) copy and/or move an arbitrary
ArrowArray
/ArrowSchema
pair to a device from the CPU and (2) copy/move an arbitraryArrowDeviceArray
/ArrowSchema
pair back to the CPU. The three types of copying I support (and maybe there could be fewer/need to be more) are:ArrowDeviceBufferInit()
: Make a non-owning buffer into an owning buffer on a device. The entry point if you want to take a slice of anArrowArrayView
and ship it to a device.ArrowDeviceBufferMove()
: Move an existing (owning) buffer to a device. For devices like the CPU, this is a true zero-copy move; for shared memory this can also sometimes be zero copy (e.g., Apple Metal -> CPU) but might also involve a copy.ArrowDeviceBufferCopy()
: Copy a section of a buffer into a preallocated section of another buffer. I'm envisioning this to be necessary when copying a String, Binary, List...we need the first and last values of the offsets buffer in order to know what portion of the data buffer to copy. It seems unnecessary to copy 4 bytes of a buffer into an owning variant covered by the first bullet but 🤷 .This PR currently provides support for the CPU device, Apple Metal, CUDA, and CUDA_HOST (i.e., CPU memory that has been registered with CUDA which CUDA copies under the hood).