feat(extensions/nanoarrow_device): Draft DeviceArray interface #205

paleolimbot · 2023-05-25T17:39:16Z

After:

Still in very much draft form; however, it does implement arbitrary ArrowArray copy to/from ARROW_DEVICE_METAL, ARROW_DEVICE_CUDA, ARROW_DEVICE_CUDA_HOST, and ARROW_DEVICE_CPU.

The nanoarrow_device extension as drafted here serves a similar purpose to nanoarrow: a means by which to create and consume the C ABI with the intention of shipping those structures to other libraries to do transformations, and potentially retrieving them again after the computation is complete. Perhaps another way to put it is that nanoarrow is designed to help at the edges: it can create and consume. Similarly, the nanoarrow_device extension is designed to help at the edges: it can copy/move arrays to and from CPU-land.

With this PR, you can currently do something like:

struct ArrowDevice* gpu = ArrowDeviceMetalDefaultDevice();
// Alternatively, ArrowDeviceCuda(ARROW_DEVICE_CUDA, 0)
// or  ArrowDeviceCuda(ARROW_DEVICE_CUDA_HOST, 0)
struct ArrowDevice* cpu = ArrowDeviceCpu();
struct ArrowArray array;
struct ArrowDeviceArray device_array;
struct ArrowDeviceArrayView device_array_view;

// Build a CPU array
ASSERT_EQ(ArrowArrayInitFromType(&array, NANOARROW_TYPE_STRING), NANOARROW_OK);
ASSERT_EQ(ArrowArrayStartAppending(&array), NANOARROW_OK);
ASSERT_EQ(ArrowArrayAppendString(&array, ArrowCharView("abc")), NANOARROW_OK);
ASSERT_EQ(ArrowArrayAppendString(&array, ArrowCharView("defg")), NANOARROW_OK);
ASSERT_EQ(ArrowArrayAppendNull(&array, 1), NANOARROW_OK);
ASSERT_EQ(ArrowArrayFinishBuildingDefault(&array, nullptr), NANOARROW_OK);

// Convert to a DeviceArray, still on the CPU
ArrowDeviceArrayInit(&device_array, cpu);
ArrowArrayMove(&array, &device_array.array);

// Parse contents into a view that can be copied to another device
ArrowDeviceArrayViewInit(&device_array_view);
ArrowArrayViewInitFromType(&device_array_view.array_view, string_type);
ASSERT_EQ(ArrowDeviceArrayViewSetArray(&device_array_view, &device_array, nullptr),
          NANOARROW_OK);

// Try to zero-copy move to another device or copy if that is not possible. Zero-copy move
// is implemented for ARROW_DEVICE_METAL and ARROW_DEVICE_CUDA_HOST for the
// gpu -> cpu case.
struct ArrowDeviceArray device_array2;
device_array2.array.release = nullptr;
ASSERT_EQ(
    ArrowDeviceArrayTryMove(&device_array, &device_array_view, gpu, &device_array2),
    NANOARROW_OK);

In concrete terms, that means we to know enough about a device to (1) copy and/or move an arbitrary ArrowArray/ArrowSchema pair to a device from the CPU and (2) copy/move an arbitrary ArrowDeviceArray/ArrowSchema pair back to the CPU. The three types of copying I support (and maybe there could be fewer/need to be more) are:

ArrowDeviceBufferInit(): Make a non-owning buffer into an owning buffer on a device. The entry point if you want to take a slice of an ArrowArrayView and ship it to a device.
ArrowDeviceBufferMove(): Move an existing (owning) buffer to a device. For devices like the CPU, this is a true zero-copy move; for shared memory this can also sometimes be zero copy (e.g., Apple Metal -> CPU) but might also involve a copy.
ArrowDeviceBufferCopy(): Copy a section of a buffer into a preallocated section of another buffer. I'm envisioning this to be necessary when copying a String, Binary, List...we need the first and last values of the offsets buffer in order to know what portion of the data buffer to copy. It seems unnecessary to copy 4 bytes of a buffer into an owning variant covered by the first bullet but 🤷 .

This PR currently provides support for the CPU device, Apple Metal, CUDA, and CUDA_HOST (i.e., CPU memory that has been registered with CUDA which CUDA copies under the hood).

codecov-commenter · 2023-05-26T17:22:39Z

Codecov Report

Merging #205 (3b5ee6c) into main (e811cfe) will decrease coverage by 0.48%.
The diff coverage is 71.26%.

@@            Coverage Diff             @@
##             main     #205      +/-   ##
==========================================
- Coverage   87.64%   87.17%   -0.48%     
==========================================
  Files          63       66       +3     
  Lines        9789    10061     +272     
==========================================
+ Hits         8580     8771     +191     
- Misses       1209     1290      +81

Impacted Files	Coverage Δ
src/nanoarrow/array.c	`91.42% <0.00%> (-0.60%)`	⬇️
src/nanoarrow/array_inline.h	`89.72% <ø> (ø)`
src/nanoarrow/nanoarrow.h	`100.00% <ø> (ø)`
.../nanoarrow_device/src/nanoarrow/nanoarrow_device.c	`66.35% <66.35%> (ø)`
.../nanoarrow_device/src/nanoarrow/nanoarrow_device.h	`100.00% <100.00%> (ø)`
...anoarrow_device/src/nanoarrow/nanoarrow_device.hpp	`100.00% <100.00%> (ø)`

... and 2 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

extensions/nanoarrow_device/CMakeLists.txt

extensions/nanoarrow_device/src/nanoarrow/nanoarrow_device.h

kkraus14 · 2023-06-09T21:10:18Z

extensions/nanoarrow_device/src/nanoarrow/nanoarrow_device_cuda.c

+static void ArrowDeviceCudaAllocatorFree(struct ArrowBufferAllocator* allocator,
+                                         uint8_t* ptr, int64_t old_size) {
+  if (ptr != NULL) {
+    cudaFree(ptr);
+  }
+}


Most GPU libraries / frameworks have their own memory pool / memory management implementations that are often asynchronous (and are ordered by CUDA streams) where they won't be able to benefit from this implementation. This is generally true for most operations: free, alloc, realloc, memset, memcpy, etc.

I'm not sure if we need an actual implementation to live within nanoarrow or if we can just define an interface for downstream libraries to implement.

An early version of ArrowDeviceBufferXXX() functions had a sync_event* argument, which I removed before I saw cudaMemcpyAsync and friends in the documentation. I don't know if that's the perfect interface, but the part of nanoarrow's generic "copy this array to the device" implementation would benefit a lot since essentially all of those buffers can be copied in parallel.

kkraus14 · 2023-06-09T21:11:47Z

extensions/nanoarrow_device/src/nanoarrow/nanoarrow_device_cuda.c

+
+  // Pointer vs. not pointer...is there memory ownership to consider here?
+  cudaEvent_t* cuda_event = (cudaEvent_t*)sync_event;
+  cudaError_t result = cudaEventSynchronize(*cuda_event);


In most situations you'd want to use cudaStreamWaitEvent as opposed to this API as its much more efficient and doesn't unnecessarily block the CPU until things are done.

I assume that is what a library doing gpu--gpu calculations would do...here I think it does need to be the blocking version (this is the method that is called before an arbitrary ArrowDeviceArray or a slice of it is copied back to the CPU).

To call cudaStreamWaitEvent you'd need to know what stream to wait on.

Yes, this function is intended to do the bad thing and block until it's safe to do CPU things. If there's a way to avoid the sync before copying back to the CPU it could be removed.

extensions/nanoarrow_device/src/nanoarrow/nanoarrow_device_cuda.c

kkraus14 · 2023-06-09T21:14:45Z

extensions/nanoarrow_device/src/nanoarrow/nanoarrow_device_cuda.c

+// specific language governing permissions and limitations
+// under the License.
+
+#include <cuda_runtime_api.h>


We'd likely be better off using the CUDA driver API here instead of the runtime API as there's much stronger forward compatibility guarantees as well as easier deployment (someone can have the driver installed but not the CUDA runtime, but not the reverse).

That does seem like a better fit (although may require implementing some reference counting of our own).

kkraus14 · 2023-06-09T21:15:45Z

extensions/nanoarrow_device/src/nanoarrow/nanoarrow_device_cuda.c

+// TODO: All these buffer copiers would benefit from cudaMemcpyAsync but there is
+// no good way to incorporate that just yet


For what it's worth: this will likely be a blocker for most libraries / frameworks to be able to utilize things.

I don't think there's a technical limitation for making ArrowDeviceArrayViewCopy() return after having kicked off all the buffer copies and populating the ArrowDeviceArray's sync_event...this is mostly a personal limitation (steep learning curve for me).

kkraus14 · 2023-06-09T21:17:06Z

extensions/nanoarrow_device/src/nanoarrow/nanoarrow_device_cuda.c

+  } else if (device_src->device_type == ARROW_DEVICE_CUDA_HOST &&
+             device_dst->device_type == ARROW_DEVICE_CUDA_HOST &&
+             device_src->device_id == device_dst->device_id) {
+    // Move
+    return 0;


Should we handle the situations where the src is ARROW_DEVICE_CUDA_HOST and dst is ARROW_DEVICE_CUDA and vice versa?

I did see cuMemHostGetDevicePointer(), so I assume this is possible. I think it would require that ArrowDeviceArrayViewCopy() has a device-specific implementation (probably for the best anyway).

cuMemHostGetDevicePointer() gets a pointer to that pinned host memory that can be used from device code but doesn't actually copy any memory to device memory. As far as I know it can be used anywhere that device memory can be used, but obviously has different performance characteristics where that would likely be very unexpected.

zeroshade

First pass

extensions/nanoarrow_device/src/nanoarrow/nanoarrow_device.h

zeroshade · 2023-06-13T18:32:52Z

extensions/nanoarrow_device/src/nanoarrow/nanoarrow_device.h

+  /// Creates a new buffer whose data member can be accessed by the GPU by
+  /// moving an existing buffer. If NANOARROW_OK is returned, src will have
+  /// been released or moved by the implementation and dst must be released by
+  /// the caller.
+  /// Implementations must check device_src and device_dst and return ENOTSUP if
+  /// not prepared to handle this operation.
+  ArrowErrorCode (*buffer_move)(struct ArrowDevice* device_src, struct ArrowBuffer* src,
+                                struct ArrowDevice* device_dst, struct ArrowBuffer* dst);


Should we specify semantics about the src and dst devices such as the src being CPU memory? etc.

The idea was that an implementation might be able to handle a few directions, although it does result in verbose method implementations. For the CUDA case the generality is somewhat useful...it can also theoretically move a buffer from CUDA_HOST to CUDA and it might be difficult to construct a method signature that captures that. That generality might also not be useful 🤷

zeroshade · 2023-06-13T18:34:28Z

extensions/nanoarrow_device/src/nanoarrow/nanoarrow_device.h

+  /// \brief Copy a section of memory into a preallocated buffer
+  ///
+  /// As opposed to the other buffer operations, this is designed to support
+  /// copying very small slices of memory.
+  /// Implementations must check device_src and device_dst and return ENOTSUP if
+  /// not prepared to handle this operation.
+  ArrowErrorCode (*buffer_copy)(struct ArrowDevice* device_src,
+                                struct ArrowDeviceBufferView src,
+                                struct ArrowDevice* device_dst,
+                                struct ArrowDeviceBufferView dst);


same question, should we put specifics as to the semantics of the device types? Should implementations have to check the device type every time for both source and destination or have to implement multiple devices?

extensions/nanoarrow_device/src/nanoarrow/nanoarrow_device.h

extensions/nanoarrow_device/src/nanoarrow/nanoarrow_device_metal.cc

zeroshade · 2023-06-13T18:59:35Z

extensions/nanoarrow_device/src/nanoarrow/nanoarrow_device.c

+static ArrowErrorCode ArrowDeviceBufferGetInt32(struct ArrowDevice* device,
+                                                struct ArrowBufferView buffer_view,
+                                                int64_t i, int32_t* out) {


Should there be an equivalent that doesn't do the copy?

I'm not sure I understand where that would be used? It's definitely suboptimal to issue copies in this way (but your suggestion of skipping validation and Keith's suggestion of leveraging async memcpy may be a workaround).

It would probably be useful for device code to be able to access the value of a specific index of the array without having to perform the copy. It also allows CPU code to find the address of the specific index (pointer into non-cpu memory) that can then be used for whatever is necessary on the device side without needing to copy the value.

Definitely don't want to issue copies one by one like this even asynchronously, that would be really really bad performance wise and put significant pressure on the system via the GPU driver.

I would +1 @zeroshade's suggestion of skipping validation and generally anything that needs to introspect the data.

Ok, I updated the function name to ArrowDeviceArrayViewResolveBufferSizes() since that's what it's actually doing and added ArrowDeviceArrayViewSetArrayMinimal() that just sets array_view->buffers[i].size_bytes to -1 if it would require a copy to calculate.

For the case of "just get me the pointer value", I don't think there needs to a be a function (array_view->buffers[i].data.as_int32 + some_logical_offset would do it).

For the case where we copy back to the CPU, I don't see a way around copying the last int32/int64 from the offsets buffer (or else there is no way to know how many bytes of the next buffer to copy). We can possibly mitigate the impact of that by asynchronously kicking off all the tiny copies at once?

Yea, issuing two single element copies for getting the starting and ending offset into the CPU is necessary, but once we have those offsets, we should just do pointer arithmetic to get a pointer and a size to feed into a copy call.

Ideally you want to issue the two individual element copies asynchronously (potentially on different streams so they can be overlapped even though they're tiny), synchronize the stream(s) since you need those values to use in host code, and then issue the actual data copy.

extensions/nanoarrow_device/src/nanoarrow/nanoarrow_device.c

kkraus14 · 2023-06-20T16:33:04Z

extensions/nanoarrow_device/CMakeLists.txt

+  if (NANOARROW_DEVICE_WITH_CUDA)
+    find_package(CUDAToolkit REQUIRED)
+    set(NANOARROW_DEVICE_SOURCES_CUDA src/nanoarrow/nanoarrow_device_cuda.c)
+    set(NANOARROW_DEVICE_LIBS_CUDA CUDA::cudart)


If we're going to continue using the runtime, may want to use the static library instead: CUDA::cudart_static.

Would still recommend using the driver library though.

Driver library is a definite yes (just haven't gotten there yet).

extensions/nanoarrow_device/src/nanoarrow/nanoarrow_device.c

kkraus14 · 2023-06-20T17:10:22Z

extensions/nanoarrow_device/src/nanoarrow/nanoarrow_device.c

+static ArrowErrorCode ArrowDeviceBufferGetInt32(struct ArrowDevice* device,
+                                                struct ArrowBufferView buffer_view,
+                                                int64_t i, int32_t* out) {


Definitely don't want to issue copies one by one like this even asynchronously, that would be really really bad performance wise and put significant pressure on the system via the GPU driver.

I would +1 @zeroshade's suggestion of skipping validation and generally anything that needs to introspect the data.

kkraus14 · 2023-06-20T17:10:36Z

extensions/nanoarrow_device/src/nanoarrow/nanoarrow_device.c

+  return NANOARROW_OK;
+}
+
+static ArrowErrorCode ArrowDeviceBufferGetInt64(struct ArrowDevice* device,


+1 to the int32 function here

I think this is handled by array_view->buffer_views[i].data.as_int32 + some_index (which would get you the pointer to an element of a buffer).

extensions/nanoarrow_device/src/nanoarrow/nanoarrow_device.c

kkraus14 · 2023-06-20T17:15:20Z

extensions/nanoarrow_device/src/nanoarrow/nanoarrow_device.c

+  // Wait on device_array to synchronize with the CPU
+  NANOARROW_RETURN_NOT_OK(device->synchronize_event(ArrowDeviceCpu(), device,
+                                                    device_array->sync_event, error));


Why do we need to synchronize on the event here? This function ultimately is just responsible for setting the points in the array view from the passed in array, correct?

If so, synchronize guarantees that the data underneath the pointer is synchronized, but doesn't impact the pointers themselves at all.

I think I correctly separated the case where it's needed (it is needed before copy to CPU, correct? Or is that synchronizatio handled by cudaMemcpy()?).

all variants of cudaMemcpy are stream ordered so you can safely use them without synchronizing the stream, but the destination data is stream ordered as well, so if you're going to operate on it from a different stream or from host code then you need to synchronize the stream in some kind of way

It seems like we do need to do the bad thing here and wait for a sync before calling cudaMemcpy() here for the GPU -> CPU direction (although hopefully this is now isolated such that it won't get accidentally called by somebody who does not need this).

We should sync after calling the cudaMemcpy() as opposed to before it. Otherwise, you could in theory get into a situation where that device to host copy is asynchronous (if you have pinned host memory for example) and accessing it from the CPU without synchronization is a race condition.

kkraus14 · 2023-06-20T17:22:37Z

extensions/nanoarrow_device/src/nanoarrow/nanoarrow_device.c

+    NANOARROW_RETURN_NOT_OK(ArrowDeviceBufferInit(device_src, buffer_view_src, device_dst,
+                                                  ArrowArrayBuffer(dst, i)));


If this buffer initialization is asynchronous then we need to set a synchronization event somewhere I think?

extensions/nanoarrow_device/src/nanoarrow/nanoarrow_device.c

extensions/nanoarrow_device/src/nanoarrow/nanoarrow_device_cuda.c

paleolimbot · 2023-06-22T17:46:51Z

Ok! There are definitely some holes in this implementation (notably around properly synchronizing memory copies). I'd propose that this PR get merged (and clearly marked as under development/experimental in the README) with some related changes grouped into some follow-ups:

Switch to using the driver library: Device extension should use the CUDA driver library (not the CUDA runtime library) #246
Properly order copy operations when copying to/from device: Device extension should implement async buffer copying for CUDA #245
Expand test coverage to more than just string arrays: Expand test coverage of device extension #247

I think it's still an open question as to whether or not this particular extension will be used/will be useful...if there is no interest in using it before the next release it can always be excluded from the source release (like the Python bindings currently are) or moved back to a PR state.

paleolimbot marked this pull request as draft May 25, 2023 17:39

paleolimbot force-pushed the device-array-ext branch 2 times, most recently from 3c95a4d to dbb4404 Compare June 9, 2023 17:01

paleolimbot marked this pull request as ready for review June 9, 2023 17:32

paleolimbot requested review from zeroshade and lidavidm June 9, 2023 17:32

kkraus14 reviewed Jun 9, 2023

View reviewed changes

zeroshade reviewed Jun 13, 2023

View reviewed changes

paleolimbot and others added 21 commits June 20, 2023 13:11

sketch device extension

8dcd72c

more boilerplate

76f4e4e

add draft device header

2967b0e

sketch the ArrowDevice

1c25735

cpu methods

76fc7ca

always set error on return

30090ac

stream wrapper implementation

c4b6176

symbol namespace

a335660

add hpp and scaffold test

bd839cc

test stream wrapper

41b1a02

some ci, docs for device extension

89467c7

fix for doc changes

13ffc7f

more CI

465535b

maybe fix bundling

826a9fc

maybe release callback on the device

69d101a

maybe namespaces

1a42008

basic dummy non-cpu tests

ccadd50

hpp helper for now-releaseable device

01fa32b

sketch metal impl sources

c085e94

add metal helper

6975479

CMake shenanegins for metal to work

9e668de

paleolimbot added 4 commits June 20, 2023 13:31

simplify synchronize_event

c33e064

fix one more sync reference

8119476

assume pointer arithmetic and remove ArrowDeviceBuffer

2cacb3b

add minimal version of setarray

8d6666d

kkraus14 reviewed Jun 20, 2023

View reviewed changes

paleolimbot and others added 18 commits June 20, 2023 16:53

start on cuda-specific device with event memory

fee2a83

passing tests

c89566a

use cudaMemcpyHostToHost

a78c3a4

use the bufferdeallocator pattern

16d51fd

theoretically handle more than one device

e50344a

be even lazier about resolving buffer sizes

b8e714f

typo

973ac50

always make the array move/copy distinction explicit

99a15f9

start on new array move

56658bb

implement the cuda mover

9952fa9

maybe use new copier for CUDA

5a6f388

fix metal build

0f97256

remove copy_required and old array mover

5810495

maybe start on event handling with metal

f41e8fd

theoretically create the event for the proper device

c1f98b4

use static lib for now

b2f3b9b

more cudaMemcpy

8ad1925

add README

e6ca953

This was referenced Jun 22, 2023

Device extension should implement async buffer copying for CUDA #245

Closed

Device extension should use the CUDA driver library (not the CUDA runtime library) #246

Closed

paleolimbot added 2 commits June 22, 2023 14:36

add a comment about sync order

9d78b5d

RAT for readme

3b5ee6c

paleolimbot merged commit 086793a into apache:main Jun 22, 2023

paleolimbot deleted the device-array-ext branch September 19, 2023 20:31

		// TODO: All these buffer copiers would benefit from cudaMemcpyAsync but there is
		// no good way to incorporate that just yet

		NANOARROW_RETURN_NOT_OK(ArrowDeviceBufferInit(device_src, buffer_view_src, device_dst,
		ArrowArrayBuffer(dst, i)));

feat(extensions/nanoarrow_device): Draft DeviceArray interface #205

feat(extensions/nanoarrow_device): Draft DeviceArray interface #205

Conversation

paleolimbot commented May 25, 2023 • edited Loading

codecov-commenter commented May 26, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zeroshade left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paleolimbot commented Jun 22, 2023 • edited Loading

paleolimbot commented May 25, 2023 •

edited

Loading

codecov-commenter commented May 26, 2023 •

edited

Loading

paleolimbot commented Jun 22, 2023 •

edited

Loading