-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(python): Document, prefix, and add reprs for C-wrapping classes #340
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #340 +/- ##
==========================================
- Coverage 88.10% 86.79% -1.32%
==========================================
Files 74 6 -68
Lines 11937 212 -11725
==========================================
- Hits 10517 184 -10333
+ Misses 1420 28 -1392 ☔ View full report in Codecov by Sentry. |
fecda6d
to
c9d1bd7
Compare
python/src/nanoarrow/__init__.py
Outdated
from nanoarrow._lib import cversion # noqa: F401 | ||
from nanoarrow.clib import ( # noqa: F401 | ||
cschema, | ||
carray, | ||
carray_stream, | ||
cschema_view, | ||
carray_view, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I wonder if it is considered more readable to use the naming convention c_<object>
instead of c<object>
since we are using underscores instead of camel case? This seems to be the pattern in the PyCapsule example[1] (see c_api_import
variable name) and in the Arrow PyCapsule docs[2] (e.g. __arrow_c_schema__
).
I'm also fine with it as-is if this is already a known pattern or is just the preferred option.
[1]https://docs.python.org/3/extending/extending.html#using-capsules
[2]https://arrow.apache.org/docs/dev/format/CDataInterface/PyCapsuleInterface.html#export-protocol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it! Done!
9ac995b
to
bc72bc4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM! However, I'm not a cython expert so will let other comment there.
python/src/nanoarrow/device.py
Outdated
|
||
|
||
def device_array(obj): | ||
if isinstance(obj, DeviceArray): | ||
def cdevice_array(obj): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can also change to c_device_array
here!
python/src/nanoarrow/__init__.py
Outdated
@@ -15,5 +15,11 @@ | |||
# specific language governing permissions and limitations | |||
# under the License. | |||
|
|||
from ._lib import Array, ArrayStream, ArrayView, Schema, c_version # noqa: F401 | |||
from .lib import array, array_stream, schema, array_view # noqa: F401 | |||
from nanoarrow._lib import cversion # noqa: F401 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it worth changing this back to c_version
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lot's of changes ;) I added a bunch of comments, but looking good!
# specific language governing permissions and limitations | ||
# under the License. | ||
|
||
[settings] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't put this in the pyproject.toml because that's not top-level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(btw, for another PR, but I would also switch to use ruff for linting, that also includes the functionality of isort)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes on both counts! I switched geoarrow-pyarrow to ruff and it was great.
python/README.md
Outdated
schema = na.Schema.allocate() | ||
schema = na.c_schema() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I find the previous way more explicit ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But seeing the version below for Array, I admit that there it is a little inconvenient you need pass an allocated schema to the Array allocation (although this could also be done for the user automatically?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great point...na.c_schema()
can in theory be used to sanitize input, and allocating a blank one has a totally different use case.
I updated this to nanoarrow.allocate_c_XXX()
for now...I'm not sure the CSchema
family of classes should be in the root namespace and defining the function in Python gives better documentation when typing in an IDE 🤷
cdef void pycapsule_schema_deleter(object schema_capsule) noexcept: | ||
cdef ArrowSchema* schema = <ArrowSchema*>PyCapsule_GetPointer( | ||
schema_capsule, 'arrow_schema' | ||
) | ||
if schema.release != NULL: | ||
ArrowSchemaRelease(schema) | ||
|
||
free(schema) | ||
ArrowFree(schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my education: is there a benefit in using the nanoarrow version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just did it for consistency with what I do in R, and because in theory we could someday add some debug check or bookkeeping to debug ArrowMalloc/ArrowFree (whereas we can't for malloc/free other than valgrind). I'm not sure I will ever get to that, though 🙂
python/src/nanoarrow/_lib.pyx
Outdated
|
||
def _addr(self): | ||
return <uintptr_t>&self.c_array_stream | ||
# To more safely implement export of an ArrowArray whose address may be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW you can also add this as a normal docstring to the function
that return Python objects and handles the C Data interface lifecycle (i.e., initialized | ||
ArrowSchema structures are always released). | ||
|
||
See `nanoarrow.c_schema()` for construction and usage examples. | ||
""" | ||
cdef object _base |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This _base
is now always a capsule? (if so, maybe add a comment saying that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is, although I'm not sure that it always will be (I added a comment).
@@ -804,64 +761,6 @@ cdef class SchemaMetadata: | |||
yield key_obj, value_obj | |||
|
|||
|
|||
cdef class ArrayChildren: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice to see those Children classes removed! ;)
def __getitem__(self, int64_t i): | ||
if i < 0 or i >= self._shape: | ||
raise IndexError(f"Index {i} out of range") | ||
cdef int64_t offset = self._strides * i | ||
value = unpack_from(self.format, buffer=self, offset=offset) | ||
if len(value) == 1: | ||
return value[0] | ||
else: | ||
return value | ||
|
||
def __iter__(self): | ||
for value in iter_unpack(self.format, self): | ||
if len(value) == 1: | ||
yield value[0] | ||
else: | ||
yield value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A Python memoryview object supports this kind of indexing, and a conversion to a python list as well (https://docs.python.org/3/library/stdtypes.html#memoryview.tolist). So a potential alternative is to reuse that (memoryview(self).tolist())
might work out of the box)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, it seems that this doesn't work with the endianness "=" you added below to the format type of the buffer protocol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the memoryview thing would be way better...I'm sure it can be workshopped to work most of the time. I used =
because "standard width" sounded appealing but I'm sure we can do some runtime check once to maximize the built-in functionlity of the memoryview. (I'll defer improvements to the buffer view for a future PR).
else: | ||
return "B" | ||
snprintf(self._format, sizeof(self._format), "%ds", self._element_size_bits // 8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed (compared to just returning the string as was done before)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For fixed-size binary/decimal the string has to be dynamically generated (e.g., 10s
), and it's slightly easier to just always point at self._format
after doing this step.
python/src/nanoarrow/_lib.pyx
Outdated
@@ -947,88 +926,28 @@ cdef class BufferView: | |||
def __releasebuffer__(self, Py_buffer *buffer): | |||
pass | |||
|
|||
def __repr__(self): | |||
return _lib_utils.buffer_view_repr(self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be nice to include a name here as well for the standalone repr (the util function only gives you the content, which is useful for including it into another repr).
Something like
return _lib_utils.buffer_view_repr(self) | |
return f"nanoarrow.c_lib.BufferView {_lib_utils.buffer_view_repr(self)[1:]}" |
(the slicing is because it already starts with a <
(that could also be changed in the util function)
|
||
lines.append(f"- {attr_name}: {repr(attr_value)}") | ||
|
||
return "\n".join(lines) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to show something about the children here?
Because right now for example for a list type, the schema view repr is less informative than the main schema repr:
In [68]: schema
Out[68]:
a: int64
b: list<item: double>
child 0, item: double
In [69]: na.c_schema(schema).child(1)
Out[69]:
<nanoarrow.c_lib.CSchema list>
- format: '+l'
- name: 'b'
- flags: 2
- metadata: NULL
- dictionary: NULL
- children[1]:
'item': <nanoarrow.c_lib.CSchema double>
- format: 'g'
- name: 'item'
- flags: 2
- metadata: NULL
- dictionary: NULL
- children[0]:
In [70]: na.c_schema_view(na.c_schema(schema).child(1))
Out[70]:
<nanoarrow.c_lib.CSchemaView>
- type: 'list'
- storage_type: 'list'
So the schema view repr doesn't say what type of list it is (just "list")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you got there, but the SchemaView doesn't know about children (although there should definitely be a class that does have all that information, it's just not implemented yet).
A better comparison is probably numpy, because the pyarrow's
But still not too bad ;) And the memoryview option I mentioned:
Now, this conversion to a list is mostly for the repr, where it is truncated to a few elements anyway, and so performance isn't really important anyway, I think. |
This PR adds a weekly/workflow dispatch job for building and testing Python wheels. This required a few housekeeping items: - Versioning the python package. I used the approach from ADBC, which is a modified 'miniver'. Basically, just set the version as a string using a regex replace when needed. - The bootstrap.py logic was updated to use a proper temporary directory - Tests were updated to skip instead of fail when pyarrow/numpy are not available (because I can never remember which platforms they will or won't install on and the default cibuildwheel grid is large). - I hadn't tested install from sdist, so a few files were missing from the manifest. At least one test doesn't pass on 32-bit Windows (already fixed in #340). For now I just enabled the version tests to make sure everything built/linked properly. --------- Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Dane Pitkin <48041712+danepitkin@users.noreply.github.com>
6f0bdd7
to
ccc2488
Compare
This PR was inspired #319 but only addresses the first half (prefixes C-wrapping classes so that the name
nanoarrow.array()
can be used for a future class/constructor that more closely resembles apyarrow.Array
ornumpy.Array
.This PR does a few things:
C
(e.g.,Schema
toCSchema
). I made them slightly more literal as well. Basically, these classes are about accessing the fields of the structure without segfaulting. In a potential future world where we don't use Cython, this is something like what we'd get with auto-generated wrapper classes or thin C++ wrappers with generated binding code.Array
,Schema
, and anArrayStream
. The scope and design of those requires more iteration than this PR allows and would benefit from some other infrastructure to be in place first (e.g., convert to/from Python)To make it a little more clear what the existing structures actually are and what they can do, I added
repr()
s for them and updated the README. Briefly:This involved fixing the existing
BufferView
since to print their contents in a repr-friendly way the elements had to be accessed. I think theBufferView
will see some changes but it does seem relatively performant: