refactor(python): Document, prefix, and add reprs for C-wrapping classes #340

paleolimbot · 2023-12-15T20:59:21Z

This PR was inspired #319 but only addresses the first half (prefixes C-wrapping classes so that the name nanoarrow.array() can be used for a future class/constructor that more closely resembles a pyarrow.Array or numpy.Array.

This PR does a few things:

Uses capsules to manage allocate/cleanup of C resources instead of "holder" objects. This eliminated some code and in theory makes it possible to move some pieces out of Cython into C.
Renames any "nanoarrow C library binding" classes to start with C (e.g., Schema to CSchema). I made them slightly more literal as well. Basically, these classes are about accessing the fields of the structure without segfaulting. In a potential future world where we don't use Cython, this is something like what we'd get with auto-generated wrapper classes or thin C++ wrappers with generated binding code.
Opens the door for the user-facing versions of these: Array, Schema, and an ArrayStream. The scope and design of those requires more iteration than this PR allows and would benefit from some other infrastructure to be in place first (e.g., convert to/from Python)

To make it a little more clear what the existing structures actually are and what they can do, I added repr()s for them and updated the README. Briefly:

import nanoarrow as na
import pyarrow as pa

na.cschema(pa.int32())
#> <nanoarrow.clib.CSchema int32>
#> - format: 'i'
#> - name: ''
#> - flags: 2
#> - metadata: NULL
#> - dictionary: NULL
#> - children[0]:

na.cschema_view(pa.timestamp('s', "America/Halifax"))
#> <nanoarrow.clib.CSchemaView>
#> - type: 'timestamp'
#> - storage_type: 'int64'
#> - time_unit: 's'
#> - timezone: 'America/Halifax'

na.carray(pa.array([1, 2, 3]))
#> <nanoarrow.clib.CArray int64>
#> - length: 3
#> - offset: 0
#> - null_count: 0
#> - buffers: (0, 3354772373824)
#> - dictionary: NULL
#> - children[0]:

na.carray_view(pa.array([1, 2, 3]))
#> <nanoarrow.clib.CArrayView>
#> - storage_type: 'int64'
#> - length: 3
#> - offset: 0
#> - null_count: 0
#> - buffers[2]:
#>   - <bool validity[0 b] >
#>   - <int64 data[24 b] 1 2 3>
#> - dictionary: NULL
#> - children[0]:

pa_array_child = pa.array([1, 2, 3], pa.int32())
pa_array = pa.record_batch([pa_array_child], names=["some_column"])
reader = pa.RecordBatchReader.from_batches(pa_array.schema, [pa_array])
na.carray_stream(reader)
#> <nanoarrow.clib.CArrayStream>
#> - get_schema(): struct<some_column: int32>

This involved fixing the existing BufferView since to print their contents in a repr-friendly way the elements had to be accessed. I think the BufferView will see some changes but it does seem relatively performant:

import pyarrow as pa
import nanoarrow as na
import numpy as np

n = int(1e6)
pa_array = pa.array(np.random.random(n))
na_array_view = na.carray_view(pa_array)

%timeit pa_array.to_pylist()
#> 169 ms ± 1.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit %timeit list(na_array_view.buffer(1))
#> 33.8 ms ± 340 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

codecov-commenter · 2023-12-16T01:49:37Z

Codecov Report

Attention: 23 lines in your changes are missing coverage. Please review.

Comparison is base (b636a8f) 88.10% compared to head (8570a08) 86.79%.

Files	Patch %	Lines
python/src/nanoarrow/_lib_utils.py	85.36%	12 Missing ⚠️
python/src/nanoarrow/c_lib.py	79.62%	11 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #340      +/-   ##
==========================================
- Coverage   88.10%   86.79%   -1.32%     
==========================================
  Files          74        6      -68     
  Lines       11937      212   -11725     
==========================================
- Hits        10517      184   -10333     
+ Misses       1420       28    -1392

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

danepitkin · 2024-01-08T22:12:52Z

python/src/nanoarrow/__init__.py

+from nanoarrow._lib import cversion # noqa: F401
+from nanoarrow.clib import ( # noqa: F401
+ cschema,
+ carray,
+ carray_stream,
+ cschema_view,
+ carray_view,


nit: I wonder if it is considered more readable to use the naming convention c_<object> instead of c<object> since we are using underscores instead of camel case? This seems to be the pattern in the PyCapsule example[1] (see c_api_import variable name) and in the Arrow PyCapsule docs[2] (e.g. __arrow_c_schema__).

I'm also fine with it as-is if this is already a known pattern or is just the preferred option.

[1]https://docs.python.org/3/extending/extending.html#using-capsules
[2]https://arrow.apache.org/docs/dev/format/CDataInterface/PyCapsuleInterface.html#export-protocol

I like it! Done!

danepitkin

Overall LGTM! However, I'm not a cython expert so will let other comment there.

danepitkin · 2024-01-09T15:36:28Z

python/src/nanoarrow/device.py



-def device_array(obj):
- if isinstance(obj, DeviceArray):
+def cdevice_array(obj):


can also change to c_device_array here!

danepitkin · 2024-01-09T15:52:08Z

python/src/nanoarrow/__init__.py

@@ -15,5 +15,11 @@
 # specific language governing permissions and limitations
 # under the License.

-from ._lib import Array, ArrayStream, ArrayView, Schema, c_version # noqa: F401
-from .lib import array, array_stream, schema, array_view # noqa: F401
+from nanoarrow._lib import cversion # noqa: F401


is it worth changing this back to c_version?

python/src/nanoarrow/_lib.pyx

jorisvandenbossche

Lot's of changes ;) I added a bunch of comments, but looking good!

jorisvandenbossche · 2024-01-10T13:39:14Z

.isort.cfg

+# specific language governing permissions and limitations
+# under the License.
+
+[settings]


We can't put this in the pyproject.toml because that's not top-level?

(btw, for another PR, but I would also switch to use ruff for linting, that also includes the functionality of isort)

Yes on both counts! I switched geoarrow-pyarrow to ruff and it was great.

jorisvandenbossche · 2024-01-10T13:41:12Z

python/README.md

-schema = na.Schema.allocate()
+schema = na.c_schema()


Personally I find the previous way more explicit ..

But seeing the version below for Array, I admit that there it is a little inconvenient you need pass an allocated schema to the Array allocation (although this could also be done for the user automatically?)

That's a great point...na.c_schema() can in theory be used to sanitize input, and allocating a blank one has a totally different use case.

I updated this to nanoarrow.allocate_c_XXX() for now...I'm not sure the CSchema family of classes should be in the root namespace and defining the function in Python gives better documentation when typing in an IDE 🤷

jorisvandenbossche · 2024-01-10T13:45:20Z

python/src/nanoarrow/_lib.pyx

 cdef void pycapsule_schema_deleter(object schema_capsule) noexcept:
 cdef ArrowSchema* schema = <ArrowSchema*>PyCapsule_GetPointer(
 schema_capsule, 'arrow_schema'
 )
 if schema.release != NULL:
 ArrowSchemaRelease(schema)

- free(schema)
+ ArrowFree(schema)


For my education: is there a benefit in using the nanoarrow version?

I just did it for consistency with what I do in R, and because in theory we could someday add some debug check or bookkeeping to debug ArrowMalloc/ArrowFree (whereas we can't for malloc/free other than valgrind). I'm not sure I will ever get to that, though 🙂

jorisvandenbossche · 2024-01-10T13:51:51Z

python/src/nanoarrow/_lib.pyx


- def _addr(self):
- return <uintptr_t>&self.c_array_stream
+# To more safely implement export of an ArrowArray whose address may be


FWIW you can also add this as a normal docstring to the function

jorisvandenbossche · 2024-01-10T13:55:30Z

python/src/nanoarrow/_lib.pyx

+ that return Python objects and handles the C Data interface lifecycle (i.e., initialized
+ ArrowSchema structures are always released).
+
+ See `nanoarrow.c_schema()` for construction and usage examples.
 """
 cdef object _base


This _base is now always a capsule? (if so, maybe add a comment saying that)

It is, although I'm not sure that it always will be (I added a comment).

jorisvandenbossche · 2024-01-10T14:13:15Z

python/src/nanoarrow/_lib.pyx

@@ -804,64 +761,6 @@ cdef class SchemaMetadata:
 yield key_obj, value_obj


-cdef class ArrayChildren:


Nice to see those Children classes removed! ;)

jorisvandenbossche · 2024-01-10T14:22:22Z

python/src/nanoarrow/_lib.pyx

+ def __getitem__(self, int64_t i):
+ if i < 0 or i >= self._shape:
+ raise IndexError(f"Index {i} out of range")
+ cdef int64_t offset = self._strides * i
+ value = unpack_from(self.format, buffer=self, offset=offset)
+ if len(value) == 1:
+ return value[0]
+ else:
+ return value
+
+ def __iter__(self):
+ for value in iter_unpack(self.format, self):
+ if len(value) == 1:
+ yield value[0]
+ else:
+ yield value


A Python memoryview object supports this kind of indexing, and a conversion to a python list as well (https://docs.python.org/3/library/stdtypes.html#memoryview.tolist). So a potential alternative is to reuse that (memoryview(self).tolist()) might work out of the box)

Hmm, it seems that this doesn't work with the endianness "=" you added below to the format type of the buffer protocol

I think the memoryview thing would be way better...I'm sure it can be workshopped to work most of the time. I used = because "standard width" sounded appealing but I'm sure we can do some runtime check once to maximize the built-in functionlity of the memoryview. (I'll defer improvements to the buffer view for a future PR).

jorisvandenbossche · 2024-01-10T14:29:01Z

python/src/nanoarrow/_lib.pyx

 else:
- return "B"
+ snprintf(self._format, sizeof(self._format), "%ds", self._element_size_bits // 8)


Why is this needed (compared to just returning the string as was done before)?

For fixed-size binary/decimal the string has to be dynamically generated (e.g., 10s), and it's slightly easier to just always point at self._format after doing this step.

jorisvandenbossche · 2024-01-10T14:33:44Z

python/src/nanoarrow/_lib.pyx

@@ -947,88 +926,28 @@ cdef class BufferView:
 def __releasebuffer__(self, Py_buffer *buffer):
 pass

+ def __repr__(self):
+ return _lib_utils.buffer_view_repr(self)


It might be nice to include a name here as well for the standalone repr (the util function only gives you the content, which is useful for including it into another repr).
Something like

Suggested change

return _lib_utils.buffer_view_repr(self)

return f"nanoarrow.c_lib.BufferView {_lib_utils.buffer_view_repr(self)[1:]}"

(the slicing is because it already starts with a < (that could also be changed in the util function)

jorisvandenbossche · 2024-01-10T14:37:01Z

python/src/nanoarrow/_lib_utils.py

+
+ lines.append(f"- {attr_name}: {repr(attr_value)}")
+
+ return "\n".join(lines)


Do we want to show something about the children here?

Because right now for example for a list type, the schema view repr is less informative than the main schema repr:

In [68]: schema Out[68]: a: int64 b: list<item: double> child 0, item: double In [69]: na.c_schema(schema).child(1) Out[69]: <nanoarrow.c_lib.CSchema list> - format: '+l' - name: 'b' - flags: 2 - metadata: NULL - dictionary: NULL - children[1]: 'item': <nanoarrow.c_lib.CSchema double> - format: 'g' - name: 'item' - flags: 2 - metadata: NULL - dictionary: NULL - children[0]: In [70]: na.c_schema_view(na.c_schema(schema).child(1)) Out[70]: <nanoarrow.c_lib.CSchemaView> - type: 'list' - storage_type: 'list'

So the schema view repr doesn't say what type of list it is (just "list")

I see you got there, but the SchemaView doesn't know about children (although there should definitely be a class that does have all that information, it's just not implemented yet).

jorisvandenbossche · 2024-01-10T14:50:29Z

This involved fixing the existing BufferView since to print their contents in a repr-friendly way the elements had to be accessed. I think the BufferView will see some changes but it does seem relatively performant:

A better comparison is probably numpy, because the pyarrow's to_pylist is notoriously slow (apache/arrow#28694):

In [76]:  %timeit list(na_array_view.buffer(1))
93.6 ms ± 8.57 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [78]: %timeit np_array.tolist()
22.5 ms ± 622 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

But still not too bad ;)

And the memoryview option I mentioned:

In [81]: %timeit memoryview(np_array).tolist()
31.3 ms ± 1.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Now, this conversion to a list is mostly for the repr, where it is truncated to a few elements anyway, and so performance isn't really important anyway, I think.

This PR adds a weekly/workflow dispatch job for building and testing Python wheels. This required a few housekeeping items: - Versioning the python package. I used the approach from ADBC, which is a modified 'miniver'. Basically, just set the version as a string using a regex replace when needed. - The bootstrap.py logic was updated to use a proper temporary directory - Tests were updated to skip instead of fail when pyarrow/numpy are not available (because I can never remember which platforms they will or won't install on and the default cibuildwheel grid is large). - I hadn't tested install from sdist, so a few files were missing from the manifest. At least one test doesn't pass on 32-bit Windows (already fixed in #340). For now I just enabled the version tests to make sure everything built/linked properly. --------- Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

Co-authored-by: Dane Pitkin <48041712+danepitkin@users.noreply.github.com>

paleolimbot marked this pull request as ready for review December 16, 2023 02:49

paleolimbot requested a review from jorisvandenbossche December 16, 2023 02:50

paleolimbot force-pushed the python-refactor branch from fecda6d to c9d1bd7 Compare January 2, 2024 17:30

paleolimbot changed the title ~~refactor(python): Prefix C-wrapping classes and move them out of the base nanoarrow import~~ refactor(python): Prefix C-wrapping classes and better document them Jan 4, 2024

paleolimbot changed the title ~~refactor(python): Prefix C-wrapping classes and better document them~~ refactor(python): Document, prefix, and add reprs for C-wrapping classes Jan 5, 2024

paleolimbot mentioned this pull request Jan 6, 2024

ci(python): Add cibuildwheel setup for Python wheels #353

Merged

danepitkin reviewed Jan 8, 2024

View reviewed changes

paleolimbot force-pushed the python-refactor branch from 9ac995b to bc72bc4 Compare January 9, 2024 02:45

danepitkin reviewed Jan 9, 2024

View reviewed changes

jorisvandenbossche reviewed Jan 10, 2024

View reviewed changes

paleolimbot added 17 commits January 10, 2024 16:20

Schema -> CSchema

e3e30cd

Array->CArray

c38abf7

DeviceArray -> CDeviceArray

460515f

use capsules instead of holders

9b0699b

move more lifecycle stuff to the same place

8485c03

remove the last Holder class

b63bbc8

remove SchemaChildren

6b76399

get rid of arraychildren

5707575

nix arrayviewchildren

7d2acb8

remove arrayviewbuffers

7433c3b

more consistent array

b336b90

run formatters

65809b8

a few more c prefixes

4aa77d7

don't import CArray/CSchema into base namespace

0364600

fix a few cython things

4f404e3

simplify a few base relationships

7a30dc0

remove unused method

be4dc23

paleolimbot and others added 21 commits January 10, 2024 16:24

move carray and friends back to root

05acb49

shuffle

f210bd8

undo rename

7d9e434

revamp readme

41d1aac

repr for schema view

7d5f0f1

add some accessors for the buffer view

21bef52

start on buffer view repr

c64d167

array view repr

49d649c

fix preconfig

a237294

fix doctests

d8a7ea1

test requested schem

92d79fa

add iterator to tests

b474415

more buffer tests

6ac57cd

test buffers

991cef5

maybe one less dot lookup

29e47d0

slightly better buffers and reprs

18ddf87

snake case c prefix

fb12b0a

clib -> c_lib

6b80909

Update python/src/nanoarrow/_lib.pyx

ced1fe4

Co-authored-by: Dane Pitkin <48041712+danepitkin@users.noreply.github.com>

more renames

fbce805

fix init

ccc2488

paleolimbot force-pushed the python-refactor branch from 6f0bdd7 to ccc2488 Compare January 10, 2024 20:26

paleolimbot added 5 commits January 10, 2024 16:39

formatting

ed32320

lib changes

6617580

docstring updates

71bac4e

explicit allocation syntax

4bf41b7

format

8570a08

paleolimbot merged commit 72a2e67 into apache:main Jan 11, 2024
2 checks passed

paleolimbot deleted the python-refactor branch January 11, 2024 19:17

paleolimbot added this to the nanoarrow 0.4.0 milestone Jan 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(python): Document, prefix, and add reprs for C-wrapping classes #340

refactor(python): Document, prefix, and add reprs for C-wrapping classes #340

paleolimbot commented Dec 15, 2023 •

edited

Loading

codecov-commenter commented Dec 16, 2023 •

edited

Loading

danepitkin Jan 8, 2024

paleolimbot Jan 9, 2024

danepitkin left a comment

danepitkin Jan 9, 2024

danepitkin Jan 9, 2024

jorisvandenbossche left a comment

jorisvandenbossche Jan 10, 2024

jorisvandenbossche Jan 10, 2024

paleolimbot Jan 10, 2024

jorisvandenbossche Jan 10, 2024

jorisvandenbossche Jan 10, 2024

paleolimbot Jan 11, 2024

jorisvandenbossche Jan 10, 2024

paleolimbot Jan 10, 2024

jorisvandenbossche Jan 10, 2024

jorisvandenbossche Jan 10, 2024

paleolimbot Jan 10, 2024

jorisvandenbossche Jan 10, 2024

jorisvandenbossche Jan 10, 2024

jorisvandenbossche Jan 10, 2024

paleolimbot Jan 10, 2024

jorisvandenbossche Jan 10, 2024

paleolimbot Jan 10, 2024

jorisvandenbossche Jan 10, 2024

jorisvandenbossche Jan 10, 2024

paleolimbot Jan 10, 2024

jorisvandenbossche commented Jan 10, 2024 •

edited

Loading

		@@ -804,64 +761,6 @@ cdef class SchemaMetadata:
		yield key_obj, value_obj


		cdef class ArrayChildren:

	return _lib_utils.buffer_view_repr(self)
	return f"nanoarrow.c_lib.BufferView {_lib_utils.buffer_view_repr(self)[1:]}"


		lines.append(f"- {attr_name}: {repr(attr_value)}")

		return "\n".join(lines)

refactor(python): Document, prefix, and add reprs for C-wrapping classes #340

refactor(python): Document, prefix, and add reprs for C-wrapping classes #340

Conversation

paleolimbot commented Dec 15, 2023 • edited Loading

codecov-commenter commented Dec 16, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danepitkin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Jan 10, 2024 • edited Loading

paleolimbot commented Dec 15, 2023 •

edited

Loading

codecov-commenter commented Dec 16, 2023 •

edited

Loading

jorisvandenbossche commented Jan 10, 2024 •

edited

Loading