-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(python): Add array creation/building from buffers #378
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This first set of changes look good!
python/src/nanoarrow/_lib.pyx
Outdated
cdef Py_ssize_t _element_size_bits | ||
cdef Py_ssize_t _shape | ||
cdef Py_ssize_t _strides | ||
cdef char _format[128] | ||
|
||
def __cinit__(self, object base, uintptr_t addr, | ||
def __cinit__(self, object base, uintptr_t addr, int64_t size_bytes, | ||
ArrowBufferType buffer_type, ArrowType buffer_data_type, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not related to this PR, just noticing while reviewing: should the buffer type be part of the buffer? That doesn't necessarily make sense for a buffer itself, only in context of a buffer as part of an array. But if you create a standalone buffer, then it doesn't make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also seems we currently don't use this attribute, except for the public type
property
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I know the same is true for the data type, but at least that one you need to support exporting the buffer to an array / the buffer protocol)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call! It's used in the arrayview repr (so I moved it to the ArrayView where it's more relevant).
python/README.md
Outdated
If you need to debug or edit any .c files used to build the native extension, you can generate the `compile_commands.json` usable by most IDEs/clang tools using [Bear](https://github.com/rizsotto/Bear). | ||
|
||
```shell | ||
bear -- python setup.py build_ext --inplace && mv compile_commands.json build | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, this is a bit cryptic to me what this does / can help you with, but I am also not familiar with the tool or the compile_commands
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll leave it out for now since it's probably just me who has to remember that command. If build/compile_commands.json
is present, most editors "just work" with the .c/.cc./.h files without additional configuration.
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #378 +/- ##
==========================================
+ Coverage 87.89% 88.23% +0.34%
==========================================
Files 76 76
Lines 13371 13840 +469
==========================================
+ Hits 11752 12212 +460
- Misses 1619 1628 +9 ☔ View full report in Codecov by Sentry. |
python/src/nanoarrow/_lib.pyx
Outdated
@property | ||
def element_size_bits(self): | ||
return self._element_size_bits | ||
|
||
@property | ||
def item_size(self): | ||
self._refresh_view_if_needed() | ||
return self._view.item_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference between both is bits vs bytes? (if so, we might want to use a more consistent naming scheme? item vs element)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the CBufferView
, the CBuffer
, and the CBufferBuilder
to all use item
(s) for the definition of item that is used by the buffer protocol/memoryview, and element
to support the definition we use, which is basically the same except supports bitmaps. I also added in a way to extract bitmaps since we'll need them shortly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Can you add some docstrings to clarify that difference between "item_.." and "element_.."?
python/src/nanoarrow/c_lib.py
Outdated
# Try import of bare capsule | ||
if _obj_is_capsule(obj, "arrow_array"): | ||
if requested_schema is None: | ||
requested_schema_capsule = CSchema.allocate()._capsule |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does the import work if you don't have an actual schema? (I suppose the above allocates an "emtpy" ArrowSchema?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it will leave you with an invalid schema. Nothing preventing the print method from working, though! (Or from using ._addr
to safely pass it somewhere else).
import nanoarrow as na
import pyarrow as pa
na.c_array(pa.array([1, 2, 3]).__arrow_c_array__()[1])
# <nanoarrow.c_lib.CArray [invalid: schema is released]>
# - length: 3
# - offset: 0
# - null_count: 0
# - buffers: (0, 6021410128064)
# - dictionary: NULL
# - children[0]:
c_buffer[0].data = <uint8_t*>buffer.buf | ||
c_buffer[0].size_bytes = <int64_t>buffer.len | ||
c_buffer[0].capacity_bytes = 0 | ||
c_buffer[0].allocator = c_pybuffer_deallocator(&buffer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see, you need to call PyBuffer_Release
, and I assume python will ensure it increases the ref on the object / keeps that alive, until that release is called (so essentially the same as with the C Data Interface release callback)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another batch of comments ;)
Most are minor details, but you will notice that the bulk of the comments has to do with CBuffer vs CBufferBuilder, and CBuffer being a "mutable" class with those set_.. methods. My initial reaction was that it might be cleaner to separate those two more clearly (eg CBuffer doesn't need those set_ methods, and the builder can maybe not subclass CBuffer / CBufferBuilder.finish can return a CBuffer instead of itself).
(not having those set_.. methods on CBuffer might also avoid the need for the _assert_buffer_count_zero
/ _refresh_view_if_needed
checks?)
Although while writing this now, I also realize that in Arrow C ++ we have the concept of a Buffer vs Mutable/ResizableBuffer as well.
python/src/nanoarrow/_lib.pyx
Outdated
self._reset_view() | ||
return self | ||
|
||
def set_pybuffer(self, obj): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of having this "mutable" interface, essentially this function just creates a CBuffer from a python buffer, right? So it could also be a class method, and instead of using CBuffer().set_pybuffer(obj)
one would do CBuffer.from_pybuffer(obj)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I think I find confusing about those set_
methods, is that it makes CBuffer look like a builder class, while there is a separate BufferBuilder class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed them! The CBuffer
should now be immutable (except in a few limited internal cases).
python/src/nanoarrow/c_lib.py
Outdated
# Set buffers. This moves ownership of the buffers as well (i.e., the objects | ||
# in the input buffers are replaced with an empty ArrowBuffer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a bit confusing (I ran into this while trying out the method):
In [74]: buf = na.c_buffer(b"0123")
In [75]: na.c_array_from_buffers(na.int32(), 1, [None, buf])
Out[75]:
<nanoarrow.c_lib.CArray int32>
- length: 1
- offset: 0
- null_count: 0
- buffers: (0, 139688822562528)
- dictionary: NULL
- children[0]:
In [76]: na.c_array_from_buffers(na.int32(), 1, [None, buf])
---------------------------------------------------------------------------
NanoarrowException Traceback (most recent call last)
Cell In [76], line 1
----> 1 na.c_array_from_buffers(na.int32(), 1, [None, buf])
File ~/scipy/repos/arrow-nanoarrow/python/src/nanoarrow/c_lib.py:287, in c_array_from_buffers(schema, length, buffers, null_count, offset, children, validation_level)
284 builder.resolve_null_count()
286 # Validate + finish
--> 287 return builder.finish(validation_level=validation_level)
File src/nanoarrow/_lib.pyx:1818, in nanoarrow._lib.CArrayBuilder.finish()
File src/nanoarrow/_lib.pyx:422, in nanoarrow._lib.Error.raise_message_not_ok()
File src/nanoarrow/_lib.pyx:417, in nanoarrow._lib.Error.raise_message()
NanoarrowException: ArrowArrayFinishBuildingDefault() failed (22): Expected int32 array buffer 1 to have size >= 4 bytes but found buffer with 0 bytes
In [77]: buf
Out[77]: CBuffer(uint8[0 b] )
When passing python object through buffer protocol, the original object keeps owning the buffer content. We can't do something similar for CBuffer? (and that's because the CArray doesn't actually keep track of CBuffer objects, but only the C ArrowArray that has a pointer to the buffer?)
And where does this actually happen? I don't see anything in CArrayBuilder.set_buffer
that would invalidate the buffer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And where does this actually happen? I don't see anything in
CArrayBuilder.set_buffer
that would invalidate the buffer?
Ah, I suppose it is the ArrowBufferMove
in the C code that does this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made the ArrowBufferMove()
explicit (and made the "move + invalidate previous") behaviour opt-in, since it's definitely confusing if you did not expect it to happen).
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
python/src/nanoarrow/_lib.pyx
Outdated
if requested_schema is not None: | ||
raise NotImplementedError("requested_schema") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason this was removed? (because it's still not supported I think, the keyword is just ignored otherwise)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add it back...I had thought requesting a schema was best-effort and not cast-or-error but I just re-read the final language.
If you're a consumer passing a requested_schema
through the protocol, it would be a very bad idea not to check that the schema you received was the one you requested (because if the producer got it wrong, you will crash). For producers like this one that don't support casting, it forces that check to happen twice (if you want to support the case where schema
is already the correct type). If you can do casting but only support some of the conversions (e.g., you can return stringviews but not listviews), the current language of the spec forces you to error.
(I know, a discussion for another place).
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
self._ptr.null_count = self._ptr.length - count | ||
return self | ||
|
||
def set_buffer(self, int64_t i, CBuffer buffer, move=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a bit of comment/docstring what this move=False vs True exactly does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see below that the docstring of the public function c_array_from_buffers
has an explanation: "Use True
to move ownership of any input buffers or children to the output array.".
I would repeat that here for when looking at the code here. And it's also not super clear to me what the consequence is exactly of doing that? Or, why would you want to set it to True?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a docstring and also set it to True
in the appropriate places in c_lib
. It's to avoid the situation where you have an ArrowBuffer
wrapping a Py_buffer
wrapping an ArrowBuffer
wrapping a Py_buffer
(which is what would happened before the last commit if move=False
and buffers=(None, <some numpy array>)
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that expanded docstring explanation helps a lot!
python/src/nanoarrow/_lib.pyx
Outdated
# Flush the buffer address from the buffer into the ArrowArray struct | ||
self._ptr.buffers[i] = ArrowArrayBuffer(self._ptr, i).data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this exactly do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have ArrowArray array
, array->buffers
is a simple array of const void*
and doesn't know how to find the ArrowBuffer
's data
member. ArrowArraySetBuffer()
and ArrowArrayFinishBuilding()
both take care of this bit of synchronization before the array is ready for outside consumption.
The gist of this PR is that I'd like the ability to create arrays for testing without pyarrow so that nanoarrow's tests can run in more places. Other than building/running in odd corner-case environments, nanoarrow in R has been great at prototyping and/or creating test data (e.g., an array with a non-zero offset, an array with a rarely-used type). This is useful for both nanoarrow to test itself and perhaps others who might want to use nanoarrow in a similar way in Python. This is a bit big...I did need to put all of it in one place to figure out what the end point was; however, I'm happy to split into smaller self-contained bits now that I know where I'm headed. After this PR, we can create an array out-of-the-box from anything that supports the buffer protocol. Importantly, this includes numpy arrays so that you can do things like generate arrays with `n` random numbers. ```python import nanoarrow as na import numpy as np ``` ```python na.c_array_view(b"12345") ``` <nanoarrow.c_lib.CArrayView> - storage_type: 'uint8' - length: 5 - offset: 0 - null_count: 0 - buffers[2]: - validity <bool[0 b] > - data <uint8[5 b] 49 50 51 52 53> - dictionary: NULL - children[0]: ```python na.c_array_view(np.array([1, 2, 3], np.int32)) ``` ``` <nanoarrow.c_lib.CArrayView> - storage_type: 'int32' - length: 3 - offset: 0 - null_count: 0 - buffers[2]: - validity <bool[0 b] > - data <int32[12 b] 1 2 3> - dictionary: NULL - children[0]: ``` While not built in to the main `c_array()` constructor, we can also now assemble an array from buffers. This has been very useful in R and ensures that we can construct just about any array if we need to. ```python array = na.c_array_from_buffers( na.struct([na.int32()]), length=3, buffers=[None], children=[ na.c_array_from_buffers( na.int32(), length=3, buffers=[None, na.c_buffer([1, 2, 3], na.int32())] ) ], ) na.c_array_view(array) ``` <nanoarrow.c_lib.CArrayView> - storage_type: 'struct' - length: 3 - offset: 0 - null_count: 0 - buffers[1]: - validity <bool[0 b] > - dictionary: NULL - children[1]: - <nanoarrow.c_lib.CArrayView> - storage_type: 'int32' - length: 3 - offset: 0 - null_count: 0 - buffers[2]: - validity <bool[0 b] > - data <int32[12 b] 1 2 3> - dictionary: NULL - children[0]: I also added the ability to construct a buffer from an iterable and wired that into the `c_array()` constructor although this is probably not all that fast. It does, however, make it much easier to write tests (because many of them currently start with `na_c_array(pa.array([1, 2, 3]))`. ```python na.c_array_view([1, 2, 3], na.int32()) ``` <nanoarrow.c_lib.CArrayView> - storage_type: 'int32' - length: 3 - offset: 0 - null_count: 0 - buffers[2]: - validity <bool[0 b] > - data <int32[12 b] 1 2 3> - dictionary: NULL - children[0]: This allows creating an array from anything supported by the `struct` module which means we can create some of the less frequently used types. ```python na.c_array_view([1, 2, 3], na.float16()) ``` CBuffer(half_float[6 b] 1.0 2.0 3.0) ```python na.c_array_view([(1, 2), (3, 4), (5, 6)], na.interval_day_time()) ``` CBuffer(interval_day_time[24 b] (1, 2) (3, 4) (5, 6)) Because it's mentaly exhausting to bitpack buffers in my head and because Arrow uses them all the time, I also think it's mission-critical to be able to create bitmaps: ```python na.c_buffer([True, False, True, True], na.bool()) ``` CBuffer(bool[1 b] 10110000) This involved fixing some issues with the existing buffer view: - The buffer view only ever saved a pointer to the device. This is a bit of a problem because even though the CPU device is static and lives forever, CUDA "device" objects will probably keep a CUDA context alive. Thus, we need a strong reference to the `CDevice` Python object (which ensures the underlying nanoarrow `Device*` remains valid). - The buffer view only handled `BufferView` input where technically all it needs is a pointer and a length. This opens it up to represent other types of buffers than just something from nanoarrow (e.g., imported from dlpack or buffer protocol). Implementing the buffer protocol as a consumer was done by wrapping the `ArrowBuffer` with a "deallocator" that holds the `Py_buffer` and ensures it is released. I still need to do some testing to ensure that it's actually released and that we're not leaking memory. This is how I do it in R and in geoarrow-c (Python) as well. Using the `ArrowBuffer` is helpful because the C-level array builder uses them to manage the memory and ensures they're all released when the array is released. Implementing the build-from-iterable involved a few more things...notably, completing the "python struct format string" <-> "arrow data type" conversion. This allows the use of `struct.pack()` which takes care of things like half-float conversion and tuples of day, month, nano conversion. I'm aware this could use a bit better documentation of the added classes/methods...I am assuming these will be internal for the time being but they definitely need a bit more than is currently there. --------- Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
The gist of this PR is that I'd like the ability to create arrays for testing without pyarrow so that nanoarrow's tests can run in more places. Other than building/running in odd corner-case environments, nanoarrow in R has been great at prototyping and/or creating test data (e.g., an array with a non-zero offset, an array with a rarely-used type). This is useful for both nanoarrow to test itself and perhaps others who might want to use nanoarrow in a similar way in Python.
This is a bit big...I did need to put all of it in one place to figure out what the end point was; however, I'm happy to split into smaller self-contained bits now that I know where I'm headed.
After this PR, we can create an array out-of-the-box from anything that supports the buffer protocol. Importantly, this includes numpy arrays so that you can do things like generate arrays with
n
random numbers.While not built in to the main
c_array()
constructor, we can also now assemble an array from buffers. This has been very useful in R and ensures that we can construct just about any array if we need to.I also added the ability to construct a buffer from an iterable and wired that into the
c_array()
constructor although this is probably not all that fast. It does, however, make it much easier to write tests (because many of them currently start withna_c_array(pa.array([1, 2, 3]))
.This allows creating an array from anything supported by the
struct
module which means we can create some of the less frequently used types.Because it's mentaly exhausting to bitpack buffers in my head and because Arrow uses them all the time, I also think it's mission-critical to be able to create bitmaps:
This involved fixing some issues with the existing buffer view:
CDevice
Python object (which ensures the underlying nanoarrowDevice*
remains valid).BufferView
input where technically all it needs is a pointer and a length. This opens it up to represent other types of buffers than just something from nanoarrow (e.g., imported from dlpack or buffer protocol).Implementing the buffer protocol as a consumer was done by wrapping the
ArrowBuffer
with a "deallocator" that holds thePy_buffer
and ensures it is released. I still need to do some testing to ensure that it's actually released and that we're not leaking memory. This is how I do it in R and in geoarrow-c (Python) as well. Using theArrowBuffer
is helpful because the C-level array builder uses them to manage the memory and ensures they're all released when the array is released.Implementing the build-from-iterable involved a few more things...notably, completing the "python struct format string" <-> "arrow data type" conversion. This allows the use of
struct.pack()
which takes care of things like half-float conversion and tuples of day, month, nano conversion.I'm aware this could use a bit better documentation of the added classes/methods...I am assuming these will be internal for the time being but they definitely need a bit more than is currently there.