Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(python): Document, prefix, and add reprs for C-wrapping classes #340

Merged
merged 50 commits into from
Jan 11, 2024

Conversation

paleolimbot
Copy link
Member

@paleolimbot paleolimbot commented Dec 15, 2023

This PR was inspired #319 but only addresses the first half (prefixes C-wrapping classes so that the name nanoarrow.array() can be used for a future class/constructor that more closely resembles a pyarrow.Array or numpy.Array.

This PR does a few things:

  • Uses capsules to manage allocate/cleanup of C resources instead of "holder" objects. This eliminated some code and in theory makes it possible to move some pieces out of Cython into C.
  • Renames any "nanoarrow C library binding" classes to start with C (e.g., Schema to CSchema). I made them slightly more literal as well. Basically, these classes are about accessing the fields of the structure without segfaulting. In a potential future world where we don't use Cython, this is something like what we'd get with auto-generated wrapper classes or thin C++ wrappers with generated binding code.
  • Opens the door for the user-facing versions of these: Array, Schema, and an ArrayStream. The scope and design of those requires more iteration than this PR allows and would benefit from some other infrastructure to be in place first (e.g., convert to/from Python)

To make it a little more clear what the existing structures actually are and what they can do, I added repr()s for them and updated the README. Briefly:

import nanoarrow as na
import pyarrow as pa

na.cschema(pa.int32())
#> <nanoarrow.clib.CSchema int32>
#> - format: 'i'
#> - name: ''
#> - flags: 2
#> - metadata: NULL
#> - dictionary: NULL
#> - children[0]:

na.cschema_view(pa.timestamp('s', "America/Halifax"))
#> <nanoarrow.clib.CSchemaView>
#> - type: 'timestamp'
#> - storage_type: 'int64'
#> - time_unit: 's'
#> - timezone: 'America/Halifax'

na.carray(pa.array([1, 2, 3]))
#> <nanoarrow.clib.CArray int64>
#> - length: 3
#> - offset: 0
#> - null_count: 0
#> - buffers: (0, 3354772373824)
#> - dictionary: NULL
#> - children[0]:

na.carray_view(pa.array([1, 2, 3]))
#> <nanoarrow.clib.CArrayView>
#> - storage_type: 'int64'
#> - length: 3
#> - offset: 0
#> - null_count: 0
#> - buffers[2]:
#>   - <bool validity[0 b] >
#>   - <int64 data[24 b] 1 2 3>
#> - dictionary: NULL
#> - children[0]:

pa_array_child = pa.array([1, 2, 3], pa.int32())
pa_array = pa.record_batch([pa_array_child], names=["some_column"])
reader = pa.RecordBatchReader.from_batches(pa_array.schema, [pa_array])
na.carray_stream(reader)
#> <nanoarrow.clib.CArrayStream>
#> - get_schema(): struct<some_column: int32>

This involved fixing the existing BufferView since to print their contents in a repr-friendly way the elements had to be accessed. I think the BufferView will see some changes but it does seem relatively performant:

import pyarrow as pa
import nanoarrow as na
import numpy as np

n = int(1e6)
pa_array = pa.array(np.random.random(n))
na_array_view = na.carray_view(pa_array)

%timeit pa_array.to_pylist()
#> 169 ms ± 1.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit %timeit list(na_array_view.buffer(1))
#> 33.8 ms ± 340 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

@codecov-commenter
Copy link

codecov-commenter commented Dec 16, 2023

Codecov Report

Attention: 23 lines in your changes are missing coverage. Please review.

Comparison is base (b636a8f) 88.10% compared to head (8570a08) 86.79%.

Files Patch % Lines
python/src/nanoarrow/_lib_utils.py 85.36% 12 Missing ⚠️
python/src/nanoarrow/c_lib.py 79.62% 11 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #340      +/-   ##
==========================================
- Coverage   88.10%   86.79%   -1.32%     
==========================================
  Files          74        6      -68     
  Lines       11937      212   -11725     
==========================================
- Hits        10517      184   -10333     
+ Misses       1420       28    -1392     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@paleolimbot paleolimbot marked this pull request as ready for review December 16, 2023 02:49
@paleolimbot paleolimbot changed the title refactor(python): Prefix C-wrapping classes and move them out of the base nanoarrow import refactor(python): Prefix C-wrapping classes and better document them Jan 4, 2024
@paleolimbot paleolimbot changed the title refactor(python): Prefix C-wrapping classes and better document them refactor(python): Document, prefix, and add reprs for C-wrapping classes Jan 5, 2024
Comment on lines 18 to 24
from nanoarrow._lib import cversion # noqa: F401
from nanoarrow.clib import ( # noqa: F401
cschema,
carray,
carray_stream,
cschema_view,
carray_view,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I wonder if it is considered more readable to use the naming convention c_<object> instead of c<object> since we are using underscores instead of camel case? This seems to be the pattern in the PyCapsule example[1] (see c_api_import variable name) and in the Arrow PyCapsule docs[2] (e.g. __arrow_c_schema__).

I'm also fine with it as-is if this is already a known pattern or is just the preferred option.

[1]https://docs.python.org/3/extending/extending.html#using-capsules
[2]https://arrow.apache.org/docs/dev/format/CDataInterface/PyCapsuleInterface.html#export-protocol

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it! Done!

Copy link
Member

@danepitkin danepitkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM! However, I'm not a cython expert so will let other comment there.



def device_array(obj):
if isinstance(obj, DeviceArray):
def cdevice_array(obj):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can also change to c_device_array here!

@@ -15,5 +15,11 @@
# specific language governing permissions and limitations
# under the License.

from ._lib import Array, ArrayStream, ArrayView, Schema, c_version # noqa: F401
from .lib import array, array_stream, schema, array_view # noqa: F401
from nanoarrow._lib import cversion # noqa: F401
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it worth changing this back to c_version?

python/src/nanoarrow/_lib.pyx Outdated Show resolved Hide resolved
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lot's of changes ;) I added a bunch of comments, but looking good!

# specific language governing permissions and limitations
# under the License.

[settings]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't put this in the pyproject.toml because that's not top-level?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(btw, for another PR, but I would also switch to use ruff for linting, that also includes the functionality of isort)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes on both counts! I switched geoarrow-pyarrow to ruff and it was great.

python/README.md Outdated
Comment on lines 78 to 97
schema = na.Schema.allocate()
schema = na.c_schema()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I find the previous way more explicit ..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But seeing the version below for Array, I admit that there it is a little inconvenient you need pass an allocated schema to the Array allocation (although this could also be done for the user automatically?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great point...na.c_schema() can in theory be used to sanitize input, and allocating a blank one has a totally different use case.

I updated this to nanoarrow.allocate_c_XXX() for now...I'm not sure the CSchema family of classes should be in the root namespace and defining the function in Python gives better documentation when typing in an IDE 🤷

cdef void pycapsule_schema_deleter(object schema_capsule) noexcept:
cdef ArrowSchema* schema = <ArrowSchema*>PyCapsule_GetPointer(
schema_capsule, 'arrow_schema'
)
if schema.release != NULL:
ArrowSchemaRelease(schema)

free(schema)
ArrowFree(schema)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my education: is there a benefit in using the nanoarrow version?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just did it for consistency with what I do in R, and because in theory we could someday add some debug check or bookkeeping to debug ArrowMalloc/ArrowFree (whereas we can't for malloc/free other than valgrind). I'm not sure I will ever get to that, though 🙂


def _addr(self):
return <uintptr_t>&self.c_array_stream
# To more safely implement export of an ArrowArray whose address may be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW you can also add this as a normal docstring to the function

that return Python objects and handles the C Data interface lifecycle (i.e., initialized
ArrowSchema structures are always released).

See `nanoarrow.c_schema()` for construction and usage examples.
"""
cdef object _base
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This _base is now always a capsule? (if so, maybe add a comment saying that)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, although I'm not sure that it always will be (I added a comment).

@@ -804,64 +761,6 @@ cdef class SchemaMetadata:
yield key_obj, value_obj


cdef class ArrayChildren:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to see those Children classes removed! ;)

Comment on lines +843 to +858
def __getitem__(self, int64_t i):
if i < 0 or i >= self._shape:
raise IndexError(f"Index {i} out of range")
cdef int64_t offset = self._strides * i
value = unpack_from(self.format, buffer=self, offset=offset)
if len(value) == 1:
return value[0]
else:
return value

def __iter__(self):
for value in iter_unpack(self.format, self):
if len(value) == 1:
yield value[0]
else:
yield value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A Python memoryview object supports this kind of indexing, and a conversion to a python list as well (https://docs.python.org/3/library/stdtypes.html#memoryview.tolist). So a potential alternative is to reuse that (memoryview(self).tolist()) might work out of the box)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it seems that this doesn't work with the endianness "=" you added below to the format type of the buffer protocol

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the memoryview thing would be way better...I'm sure it can be workshopped to work most of the time. I used = because "standard width" sounded appealing but I'm sure we can do some runtime check once to maximize the built-in functionlity of the memoryview. (I'll defer improvements to the buffer view for a future PR).

else:
return "B"
snprintf(self._format, sizeof(self._format), "%ds", self._element_size_bits // 8)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed (compared to just returning the string as was done before)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For fixed-size binary/decimal the string has to be dynamically generated (e.g., 10s), and it's slightly easier to just always point at self._format after doing this step.

@@ -947,88 +926,28 @@ cdef class BufferView:
def __releasebuffer__(self, Py_buffer *buffer):
pass

def __repr__(self):
return _lib_utils.buffer_view_repr(self)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice to include a name here as well for the standalone repr (the util function only gives you the content, which is useful for including it into another repr).
Something like

Suggested change
return _lib_utils.buffer_view_repr(self)
return f"nanoarrow.c_lib.BufferView {_lib_utils.buffer_view_repr(self)[1:]}"

(the slicing is because it already starts with a < (that could also be changed in the util function)


lines.append(f"- {attr_name}: {repr(attr_value)}")

return "\n".join(lines)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to show something about the children here?

Because right now for example for a list type, the schema view repr is less informative than the main schema repr:

In [68]: schema
Out[68]: 
a: int64
b: list<item: double>
  child 0, item: double

In [69]: na.c_schema(schema).child(1)
Out[69]: 
<nanoarrow.c_lib.CSchema list>
- format: '+l'
- name: 'b'
- flags: 2
- metadata: NULL
- dictionary: NULL
- children[1]:
  'item': <nanoarrow.c_lib.CSchema double>
    - format: 'g'
    - name: 'item'
    - flags: 2
    - metadata: NULL
    - dictionary: NULL
    - children[0]:

In [70]: na.c_schema_view(na.c_schema(schema).child(1))
Out[70]: 
<nanoarrow.c_lib.CSchemaView>
- type: 'list'
- storage_type: 'list'

So the schema view repr doesn't say what type of list it is (just "list")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you got there, but the SchemaView doesn't know about children (although there should definitely be a class that does have all that information, it's just not implemented yet).

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jan 10, 2024

This involved fixing the existing BufferView since to print their contents in a repr-friendly way the elements had to be accessed. I think the BufferView will see some changes but it does seem relatively performant:

A better comparison is probably numpy, because the pyarrow's to_pylist is notoriously slow (apache/arrow#28694):

In [76]:  %timeit list(na_array_view.buffer(1))
93.6 ms ± 8.57 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [78]: %timeit np_array.tolist()
22.5 ms ± 622 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

But still not too bad ;)

And the memoryview option I mentioned:

In [81]: %timeit memoryview(np_array).tolist()
31.3 ms ± 1.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Now, this conversion to a list is mostly for the repr, where it is truncated to a few elements anyway, and so performance isn't really important anyway, I think.

paleolimbot added a commit that referenced this pull request Jan 10, 2024
This PR adds a weekly/workflow dispatch job for building and testing
Python wheels. This required a few housekeeping items:

- Versioning the python package. I used the approach from ADBC, which is
a modified 'miniver'. Basically, just set the version as a string using
a regex replace when needed.
- The bootstrap.py logic was updated to use a proper temporary directory
- Tests were updated to skip instead of fail when pyarrow/numpy are not
available (because I can never remember which platforms they will or
won't install on and the default cibuildwheel grid is large).
- I hadn't tested install from sdist, so a few files were missing from
the manifest.

At least one test doesn't pass on 32-bit Windows (already fixed in
#340). For now I just enabled the version tests to make sure everything
built/linked properly.

---------

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@paleolimbot paleolimbot merged commit 72a2e67 into apache:main Jan 11, 2024
2 checks passed
@paleolimbot paleolimbot deleted the python-refactor branch January 11, 2024 19:17
@paleolimbot paleolimbot added this to the nanoarrow 0.4.0 milestone Jan 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants