-
Notifications
You must be signed in to change notification settings - Fork 838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Arrow PyCapsule Interface #5070
Conversation
copy
method to FFI_ArrowArray
arrow-data/src/ffi.rs
Outdated
/// # Safety | ||
/// | ||
/// Must be passed a valid [FFI_ArrowArray]. | ||
pub unsafe extern "C" fn release_array(array: *mut FFI_ArrowArray) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel fairly strongly we should not expose this method publicly, it is very hard to use correctly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed this doesn't seem useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The motivation for exporting this was to enable creating new PyCapsules from rust with a destructor that releases the array, but it looks like it should be possible to manually call drop
without exporting the underlying function:
fn pycapsule_array_destructor(ffi_array: FFI_ArrowArray, _capsule_context: *mut c_void) {
drop(ffi_array)
}
let ffi_array = FFI_ArrowArray::new(...);
let array_capsule_name = CString::new("arrow_array").unwrap();
Python::with_gil(|py| {
let capsule = PyCapsule::new_with_destructor(
py,
ffi_array,
Some(array_capsule_name),
pycapsule_array_destructor,
)
.unwrap();
});
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW drop is actually a no-op. If you look at the definition it is just an empty function body that consumes the argument.
You should be able to just use PyCapsule::new
and it will work correctly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with the pycapsule internals, but one thing I wonder is whether it runs any destructor if you use PyCapsule::new
instead of PyCapsule::new_with_destructor
. I.e. do you need even an empty
fn pycapsule_array_destructor(ffi_array: FFI_ArrowArray, _capsule_context: *mut c_void) {}
to get the callback to take ownership again of the ffi_array
to get rust to drop it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with the pycapsule internals, but one thing I wonder is whether it runs any destructor if you use
PyCapsule::new
instead ofPyCapsule::new_with_destructor
.
Yes, it seems it does.
https://github.com/PyO3/pyo3/blob/c8fdb806300ff82747f58bcb9ca08abd3da480b4/src/types/capsule.rs#L273-L282
and
https://github.com/PyO3/pyo3/blob/c8fdb806300ff82747f58bcb9ca08abd3da480b4/src/types/capsule.rs#L78-L84
arrow-schema/src/ffi.rs
Outdated
unsafe impl Send for FFI_ArrowSchema {} | ||
unsafe impl Sync for FFI_ArrowSchema {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we confident these are actually safe to be added?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I can't say I understand Send/Sync well enough to be 100% confident this is safe.
The motivation is that right now it's impossible to create a PyCapsule
instance from an FFI_ArrowSchema
because it's not Send
. (See PyCapsule::new
).
let ffi_schema = FFI_ArrowSchema::try_from(&*field).unwrap();
let schema_capsule_name = CString::new("arrow_schema").unwrap();
Python::with_gil(|py| {
let capsule = PyCapsule::new(
py,
ffi_schema,
Some(schema_capsule_name),
)
.unwrap();
});
gives
error[E0277]: `*const i8` cannot be sent between threads safely
--> src/array/point.rs:35:17
|
33 | let capsule = PyCapsule::new(
| -------------- required by a bound introduced by this call
34 | py,
35 | ffi_schema,
| ^^^^^^^^^^ `*const i8` cannot be sent between threads safely
|
= help: within `arrow::ffi::FFI_ArrowSchema`, the trait `std::marker::Send` is not implemented for `*const i8`
note: required because it appears within the type `FFI_ArrowSchema`
--> /Users/kyle/.cargo/git/checkouts/arrow-rs-53e12341924f0a1c/b1c43a4/arrow-schema/src/ffi.rs:71:12
|
71 | pub struct FFI_ArrowSchema {
| ^^^^^^^^^^^^^^^
note: required by a bound in `pyo3::types::PyCapsule::new`
--> /Users/kyle/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pyo3-0.20.0/src/types/capsule.rs:78:29
|
78 | pub fn new<T: 'static + Send + AssertNotZeroSized>(
| ^^^^ required by this bound in `PyCapsule::new`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That code is incorrect, IIUC. ffi_schema
is a stack-allocated Rust FFI_ArrowSchema
. You're trying to store its pointer into a PyCapsule
with Python-driven lifetime. It's a good thing that Rust is preventing it :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we able to just add Send
. I'm more confident that is true than Sync
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, scratch that. PyO3's new_with_destructor
actual wraps all its arguments using Box::new
, so lifetime should be ok...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whilst it is never explicitly stated in the C Data interface, I think it is a fairly safe assumption that data can be sent between threads - i.e. it isn't relying on thread-local state or non-atomic reference counts, etc... Happy to be corrected on this front though.
I'm more confident that is true than Sync.
Sync is definitely a stronger assumption that we don't appear to need to make here, although I also struggle to devise a sane example where it wouldn't hold.
Perhaps we could get some of the people who designed PyCapsule to weigh in here, from my reading of it this appears to primarily be an API for python libraries to communicate with each other? I also am very confused by what the ownership semantics of this API are, does the PyCapsule remain the owner? This feels like it can't traverse an FFI interface safely, so I am very confused as to how this is supposed to work... Perhaps @jorisvandenbossche or @pitrou ?? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a Rust programmer, but LGTM at a high level.
arrow-data/src/ffi.rs
Outdated
/// # Safety | ||
/// | ||
/// Must be passed a valid [FFI_ArrowArray]. | ||
pub unsafe extern "C" fn release_array(array: *mut FFI_ArrowArray) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed this doesn't seem useful.
@WJones @milesgranger You may want to look at this. |
It is designed as a Python-specific protocol (in the Python sense: it uses well-known special methods) around the C Data Interface, and automates conversion between producers and consumers of Arrow data. There is no requirement that the producer or consumer is implemented in Python. The two main points of this protocol are:
In practice:
The PyCapsule just contains a raw pointer. The semantics of the pointer embedded in a PyCapsule are usage-dependent. In this case, the ownership semantics are those of the C Data Interface itself: the contents can be moved by setting the |
Ok, so there is an implicit contract that a PyCapsule is never used multiple times? |
No, the explicit contract is that a C Data Interface struct can only be used once :-) The PyCapsule is just a Python-safe way of transporting that struct. It does not change anything to the core semantics. |
Ok this is what I needed to know, PyCapsule has the same release behaviour as the C Data interface, the copy and custom release behaviour led me to believe it was something reusable. |
Yes, "copy" is a misnomer, that's why I suggested "move" instead as a method name. |
Should there be some unit tests for this? You can trivially create objects exposing this protocol using PyArrow as a backend, for example: |
The arrow-pyarrow-integration-testing contains some testing infrastructure for this sort of test |
Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>
arrow-schema/src/ffi.rs
Outdated
unsafe impl Send for FFI_ArrowSchema {} | ||
unsafe impl Sync for FFI_ArrowSchema {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we able to just add Send
. I'm more confident that is true than Sync
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential refactor for later, but it would be nice to move the Capsule imports to their own functions that aren't named as if to by PyArrow-specific. Right now this module is named for PyArrow, but the import logic will apply to other Arrow implementations as well.
Co-authored-by: Will Jones <willjones127@gmail.com>
…into kyle/arrow-ffi-copy
Unfortunately, looking at the PyO3 PyCapsule-wrapping APIs, this probably can't work, because PyO3 creates its own boxed structure for the capsule pointer: Non-Rust implementations will expect the capsule pointer to point to a This seems to make PyO3's PyCapsule wrapper generally impractical for communicating with non-Rust runtimes. Edit: scratch that, |
arrow/src/pyarrow.rs
Outdated
@@ -262,9 +263,10 @@ impl FromPyArrow for ArrayData { | |||
validate_pycapsule(array_capsule, "arrow_array")?; | |||
|
|||
let schema_ptr = unsafe { schema_capsule.reference::<FFI_ArrowSchema>() }; | |||
let array_ptr = unsafe { array_capsule.reference::<FFI_ArrowArray>() }; | |||
let array_ptr = array_capsule.pointer() as *mut FFI_ArrowArray; | |||
let array_mut = unsafe { array_ptr.as_mut() }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW you could use https://doc.rust-lang.org/std/ptr/fn.replace.html as we're already playing with pointers here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in e7ed58d; is that what you meant?
Thanks! I'll try to get to integration tests tonight |
I added some tests; do those look ok? Do you want to test against old versions of pyarrow as well? We're effectively no longer testing I had to bump the python version in CI to test with pyarrow 14 at all because pyarrow 14 requires python >=3.8. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests LGTM.
@kylebarron Do you plan to expose the reverse operation, for example a ToPyCapsule
trait that would allow exporting Rust data to Python without having PyArrow installed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
I think it would be good to add some test coverage for the pre arrow-14 logic, if that isn't too much effort, but happy for this to be done as a follow up PR, just let me know your preference
// Newer versions of PyArrow as well as other libraries with Arrow data implement this | ||
// method, so prefer it over _export_to_c. | ||
// See https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html | ||
if value.hasattr("__arrow_c_array__")? { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will allow passing a StructArray
to a method accepting PyArrowType<RecordBatch>
provided the StructArray
isn't nullable. I don't think this is a problem, RecordBatch is a bit of an oddity, but just an observation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah... I'm not sure if there's a standard way in Arrow to discern between a StructArray intended to be an array, vs a StructArray intended to represent a RecordBatch
Unless there are some changes from the second part of apache/arrow#38010, to support this arrow-rs would need to add new Python classes to represent these objects and include the dunder methods, right? Is there appetite to do this, or do people feel that adding new classes would expand the scope of arrow-rs too much? For my own work in geoarrow-rs I'll be creating my own python classes from rust to represent different types of geometry arrays (as we're discussing in geoarrow/geoarrow-python#38). For the time being I need my own version of the pycapsule interface to ensure I'm not dropping the extension type (ref #4472), but contributing code upstream here ensures it's valid 😅 (and originally I thought I needed upstream changes). |
Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>
I added pyarrow 13 and 14 to the test matrix. So in pyarrow 13, we should see capsule-based tests be skipped but other tests pass. You might have a better suggestion for structuring that matrix, as it won't upgrade to pyarrow 15 when that's released? |
From my understanding these classes would be little more than basic shims, in which case I don't see an issue, but I could be missing something |
At first glance, I thought it would open a can of worms, but maybe not... Would we only have three shims? One per pycapsule type? So exporting a data type/field/schema would all be exported as an One thing I wonder about: if someone sees an |
Three capsules makes sense to me to match tbe corresponding method names. SchemaCapsule, ArrayCapsule and StreamCapsule perhaps. I think it might even be possible to keep them crate private 🤔
Is this form of runtime reflection common, do we need to support it, or is the correct usage implicit from context? |
Do you need the shims in arrow-rs? It might be enough to expose a function that creates and returns the capsules, and then another library that depends on arrow-rs (like geoarrow-rs) could use that to get the capsules which they can expose on their objects. |
Co-authored-by: Will Jones <willjones127@gmail.com>
Let's continue the discussion of exporting to PyCapsules here: #5079 |
PyTryInto::try_into(value.getattr("__arrow_c_schema__")?.call0()?)?; | ||
validate_pycapsule(capsule, "arrow_schema")?; | ||
|
||
let schema_ptr = unsafe { capsule.reference::<FFI_ArrowSchema>() }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, I'm curious... does it actually drop the FFI_ArrowSchema so as to call it's release pointer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the pycapsule not handle this, as we don't move the data out of it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, yes, that's true.
You don't need new Python classes, you would just return Python objects for the given capsules. |
Which issue does this PR close?
Closes #5067.
Rationale for this change
The PyCapsule interface provides a much safer interface for exposing memory to/from Python, and additionally will work with any arrow library in Python instead of only pyarrow.
What changes are included in this PR?
Implement
Send
onFFI_ArrowSchema
.Send
is required to be able to create a new PyCapsule with anFFI_ArrowSchema
.Conversion from
DataType
,Field
,Schema
,Array
,RecordBatch
,RecordBatchReader
, andTable
through the PyCapsule interface.It's not possible (I don't think) to implement exporting to the PyCapsule interface because no Python classes are exposed.
Are there any user-facing changes?
None