Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Add Python protocol for the Arrow C (Data/Stream) Interface #35531

Closed
jorisvandenbossche opened this issue May 10, 2023 · 30 comments · Fixed by #37797
Closed

[Python] Add Python protocol for the Arrow C (Data/Stream) Interface #35531

jorisvandenbossche opened this issue May 10, 2023 · 30 comments · Fixed by #37797
Assignees
Milestone

Comments

@jorisvandenbossche
Copy link
Member

Context: we want that Arrow can be used as the format to share data between (Python) libraries/applications, ideally in a generic way that doesn't need to hardcode for specific libraries.
We already have __arrow_array__ for objects that know how to convert itself to a pyarrow.Array or ChunkedArray. But this protocol is for actual pyarrow objects (so a better name might have been __pyarrow_array__ ..), thus tied to the pyarrow library (and also only for arrays, not for tables/batches). For projects that have an (optional) dependency on pyarrow, that is fine, but we want to avoid that this is required (e.g. nanoarrow). However, we also have the Arrow C Data Interface as a more generic way to share Arrow data in-memory focusing on the actual Arrow spec without relying on a specific library implementation.

Right now, the way to use the C Interface are the _export_to_c and _import_from_c methods.
But those methods are 1) private, advanced APIs (although we can of course decide to make them "official", since many projects are already using them, and document them that way), and 2) again specific to pyarrow (I don't think other projects have adopted the same names).
So other projects (polars, datafusion, duckdb, etc) use those to convert from pyarrow to their own representation. But those projects don't have a similar API to use the C Data Interface to share their data with another (eg to pyarrow, or polars to duckdb, ...).
If we would have a standard Python protocol (dunder) method for this, libraries could implement support for consuming (and producing) objects that expose their data through the Arrow C Interface without having to hard code for specific implementations (such as those libraries currently do for pyarrow).

The most generic protocol would be one supporting the Stream interface, and that could look something like this:

class MyArrowCompatibleObject:

    def __arrow_c_stream__(self) -> PyCapsule:
        """
        Returning a PyCapsule wrapping an ArrowArrayStream struct
        """
        ...

And in addition we could have variants that do the same for the other structs, such __arrow_c_data__ or __arrow_c_array__, __arrow_c_schema__, ..

Some design questions:

  • For the mechanics of the method, I would propose to use PyCapsules instead of raw pointers as described here: [Python] Use PyCapsule for communicating C Data Interface pointers at the Python level #34031
  • Which set of protocol methods do we need? Is only a stream version sufficient (since a single array can always be put in a stream of one array)? Or would it be useful (and simpler for some applications) to also have an Array version?
    • But what would an array version return exactly? (since it needs to return both the ArrowArray as the ArrowSchema)
  • With the ongoing discussion about generalizing the C Interface to other devices (GH-34971: [Format] Add non-CPU version of C Data Interface #34972), should we focus here on the current interfaces, or should we directly use the Device versions?
  • Do we want to distinguish between an array and a tabular version? From the C Interface point of view, that's all the same, it's just a ArrowArray. But for example, we currently define _export_to_c on a RecordBatch and RecordBatchReader, where you know this will always return a StructArray representation of one batch, vs the same method on Array where it can return an array of any type. It could be nice to distinguish those use cases for consumers.
@pitrou
Copy link
Member

pitrou commented May 10, 2023

  • But what would an array version return exactly? (since it needs to return both the ArrowArray as the ArrowSchema)

It doesn't need to. You can have DataType.__arrow_c_schema__ and Array.__arrow_c_array__ (also for Schema and RecordBatch, respectively).

It could be nice to distinguish those use cases for consumers.

I'm not sure that's useful. @lidavidm Thoughts?

@pitrou
Copy link
Member

pitrou commented May 10, 2023

Also, this proposal doesn't dwell on the consumer side. Would there be higher-level APIs to construct Array and RecordBatch from those capsules?

@jorisvandenbossche
Copy link
Member Author

Also, this proposal doesn't dwell on the consumer side. Would there be higher-level APIs to construct Array and RecordBatch from those capsules?

Yes, indeed I currently didn't touch on that aspect. I think that could certainly be useful, but thought to start with the producer side of things. And some consumers might already have an entry point that could be reused for this (for example, duckdb already implicitly reads from any object that is a pandas DataFrale, pyarrow Table, RecordBatch, Dataset/Scanner, RecordBatchReader, polars DataFrame, ...., and they could just extend this to any object implementing this protocol).
Making the parallel with DLPack again, they recommend that libraries implement a from_dlpack function as the consumer interface. So we could here also have such a recommendation (for example from_arrow, although that might need to differentiate between stream/array/schema), but that's maybe less essential initially? (that's more about user facing API)

@pitrou
Copy link
Member

pitrou commented May 10, 2023

Yes, I don't think we have to recommend a consumer API. But we'll have to choose one for ourselves ;-)

@lidavidm
Copy link
Member

It could be nice to distinguish those use cases for consumers.

I'm not sure that's useful. @lidavidm Thoughts?

I'm also not sure it's useful, but it seems we could define __arrow_c_array__ after the fact if we find a use case.

@paleolimbot
Copy link
Member

Just a note that I think __arrow_c_array__ and __arrow_c_schema__ are rather essential (I'd build nanoarrow's Python support on top of them). I think it's fairly uncontroversial that their behaviour should align with __arrow_c_array_stream__. A concrete example of somewhere that might implement __arrow_c_schema__ is a GeoArrow type representation...currently they're stored as something more like an integer type ID because it's faster. Substrait types could also implement it or maybe pandas dtypes. It would be rather useful if numpy/pandas.Series implemented __arrow_c_array__, no?

I don't know it if it was mentioned in the discussion, but I think it's fairly important that the PyCapsule have a finalizer that calls the release() callback (if non-null), and to have that documented. I assume that's the point of using the PyCapsule but I haven't discussed that with anybody except maybe in passing with Joris.

Do we want to distinguish between an array and a tabular version?

Most ways that I know about to create an ArrowArray (pa.array(), pa.record_batch(), arrow::as_arrow_array(), etc.) also accept a type or schema. Above the level of "array or table", there are certainly objects whose "one true Arrow type" is ambiguous. You could do __arrow_c_array__(self, schema=None) and __arrow_c_array_stream__(self, schema=None). That gets a little hard because then either the producer or the consumer has to do some sort of equality check or validation.

Did you envision that __arrow_c_stream__() could return things that are not tables? They certainly can and do outside pyarrow (I beleive Rust2 supports it...nanoarrow in R does too). It's a fairly useful representation of a ChunkedArray since there's no other officially ABIified way to do that.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented May 24, 2023

Yes, I don't think we have to recommend a consumer API. But we'll have to choose one for ourselves ;-)

Indeed. And for pyarrow, it could also be something like RecordBatchReader.from_arrow_stream (or from_arrow_c_stream, or other name), and similarly for other objects, to keep it consistent with existing from_ methods.

[Do we want to distinguish between an array and a tabular version? ...] It could be nice to distinguish those use cases for consumers.

I'm not sure that's useful. @lidavidm Thoughts?

I'm also not sure it's useful, but it seems we could define __arrow_c_array__ after the fact if we find a use case.

To clarify this part a bit, and assume we are talking about the ArrowArray version to keep it simple (not the stream). Currently, a pyarrow.Array can be exported to an ArrowArray, and a pyarrow.RecordBatch as well (but in the second case, you know you always have a struct type).
The C Interface itself doesn't distinguish between both (and that's fine), but in practice the interface is used for both "types" of data (array vs tabular). And for a consumer, I can imagine it would be useful to distinguish. For example, assume that pandas has a function to construct a pandas.DataFrame from any object that supports this protocol. In that case, pandas might only be interested in data that logically represents tabular data, and not an array (because then you don't have column names, might have top-level nulls, etc). In case there is only a single __arrow_c_array__, pandas could of course check if the data it received matches the requirements for tabular data (i.e. is a struct array and has no validity bitmap). But if there would be two protocol methods (eg __arrow_c_array__ and __arrow_c_batch__), it could only check for objects that define the second method (and declare themselves as tabular data)

Did you envision that __arrow_c_stream__() could return things that are not tables? They certainly can and do outside pyarrow (I beleive Rust2 supports it...nanoarrow in R does too). It's a fairly useful representation of a ChunkedArray since there's no other officially ABIified way to do that.

Yes, it currently essentially returns an array, not a table. We just mostly use for tables in practice.

As a concrete example: in the arrow-rs implementation, the RecordBatch conversion to/from pyarrow actually iterates over each field to convert field by field using the C interface on each array, instead of using a single C interface call using a struct array for the full RecordBatch (https://github.com/apache/arrow-rs/blob/3adca539ad9e1b27892a5ef38ac2780aff4c0bff/arrow/src/pyarrow.rs#L167-L204)
(EDIT: this example is for the array interface, not the stream interface, though. It might be true that in practice the stream interface is only being used for tabular data)

@jorisvandenbossche
Copy link
Member Author

I don't know it if it was mentioned in the discussion, but I think it's fairly important that the PyCapsule have a finalizer that calls the release() callback (if non-null), and to have that documented.

Yes, that's certainly the idea, and discussed in #34031

Most ways that I know about to create an ArrowArray (pa.array(), pa.record_batch(), arrow::as_arrow_array(), etc.) also accept a type or schema. ..

That's a good question. Other protocols like __array__ or __arrow_array__ (which are library specificy, numpy and pyarrow, respectively), also support this. But more similar protocols like __array_interface__ or __dlpack__ (which both also share pointers to buffers, library-independent) don't do this.

I think one aspect here is that this second set of methods assume you "just" give access to the actual, underlying data, and not do any conversion (and so are always zero-copy), or otherwise would raise an error if the type is not supported through the protocol (and so there is never a question of what the "correct" type would be for the C interface).
In the DataFrame world, there is more variation in data types, though, and so this might be less straightforward. You will much more easily end up in a situation where the DataFrame object has a column of data that is not exactly / natively in arrow memory, and in those cases there might indeed be some ambiguity in which type should be used. Or, whether such conversion should be supported or rather an error should be raised.

@pitrou
Copy link
Member

pitrou commented May 24, 2023

But if there would be two protocol methods (eg __arrow_c_array__ and __arrow_c_batch__), it could only check for objects that define the second method (and declare themselves as tabular data)

That doesn't sound particularly useful? Especially as Pandas will probably have to check for other things (such as the datatypes it supports).

@paleolimbot
Copy link
Member

Or, whether such conversion should be supported or rather an error should be raised.

I think you could do both via schema=None as the default? That's the case most of the time anyway ("just get me this array").

class Integerish:

    def __arrow_c_array__(self, schema=None):
        if schema is not None and :
            raise ValueError("Only default export supported")

        return make_array_capsule(self), make_int_schema_capsule()

Of course, if schema happens to be the correct type this will still error. In R/nanoarrow I also have infer_nanoarrow_schema() (the type the array would be if it were requested) so a consumer that, say, only supports signed integer types can choose the appropriate signed type if ___arrow_c_schema__() returns an unsigned one. A producer like pyarrow can cast no problem (more minimal producers would probably just error).

@wjones127
Copy link
Member

I like Dewey's suggestion for having an API to request casting types you don't know how to use, given there is also an API to get the zero-copy schema. This seems especially useful for Arrow "types" that are just encodings of the same logical type, such as Utf8 / LargeUtf8 (and soon StringView).

@pitrou
Copy link
Member

pitrou commented Sep 6, 2023

Is this issue a duplicate of #34031?

@paleolimbot
Copy link
Member

I think so!

@paleolimbot
Copy link
Member

I suppose the other title is more specific to the PyCapsule than the interface, but there seems to be general agreement that the protocol methods should return a PyCapsule?

@wjones127
Copy link
Member

To move this forward, I tried to capture what's been discussed so far into a coherent draft specification.

https://docs.google.com/document/d/1Xjh-N7tQTfc2xowxsEv1Pymx4grvsiSEBQizDUXAeNs/edit?usp=sharing

@paleolimbot I'm particularly curious if I understood your idea of passing a requested schema in correctly. LMK what you think.

If the folks on this thread so far are pleased with this, I'm happy to make a PR and share on the mailing list.

@paleolimbot
Copy link
Member

Thank you for writing this up! You nailed what I had in mind with respect to schema negotiation. What you have here strikes (in my opinion) the perfect balance of easy-to-use (because any complexity of usage only comes up if the caller chooses to use the requested_schema) and easy-to-implement (because the producer is free to ignore the schema argument).

FWIW, the order I usually see is array then schema (e.g., for return type of the __arrow_c_array__) (e.g., the same order you currently pass _export_to_c(array_addr, schema_addr)).

I don't know if it's worth noting that you can differentiate between a "schema-like object" and an "array-like object" by checking if it implements __arrow_c_array__? If you have a function that wants to accept a "schema" in the most generic way possible (e.g., there's a function in geoarrow-c that initializes a compute kernel based on the input argument data types), you probably want to error if somebody tries to pass an Array. In R/nanoarrow I use two generics to deal with this (infer_nanoarrow_schema() and as_nanoarrow_schema()), but I don't think you need to/should do that here.

@pitrou
Copy link
Member

pitrou commented Sep 19, 2023

@wjones127 That looks basically good to me. I would make the requested_schema argument mandatory (potentially None).

Relatedly, I've just opened python/cpython#109562

@jorisvandenbossche
Copy link
Member Author

Thanks a lot @wjones127 for picking this up! The proposal looks solid to me.

One open item from the top post that I want to mention again: what to do with the Device support that in the meantime landed? Do we directly use those structs instead? Probably not, because they are not yet widely supported, so exclusively using that would hinder adoption of this protocol. But it would be good that we at least think about it so we can ensure we can later on expand the protocol in a compatible way if there is demand for having this protocol handle different device types as well?
(I have to look in more detail when I have time, but from a quick look, dlpack seems to have a __dlpack_device__ to inspect that)

@wjones127
Copy link
Member

Do we directly use those structs instead? Probably not, because they are not yet widely supported, so exclusively using that would hinder adoption of this protocol.

I suppose one could argue their low adoption is partly because we haven't done enough to push non-CPU data as a first class thing in Arrow. I'm open to making the protocol implementation always return the device variants instead. It shouldn't be much harder to construct the CPU variants than using the normal C Data Interface.

Although for the device interface, I'm not sure if there is expected to be additional negotiation for devices. For example, could a caller request the buffers be moved to CPU if they aren't already there? Or does that not make sense?

@paleolimbot
Copy link
Member

I would be ever-so-slightly in favour of not using the device version for the initial protocol but I also agree that it's worth ensuring that it is implemented in such a way that it doesn't block a future protocol that supports the device-enabled version. I don't think that it would be all that difficult for producers where this matters to implement a (hypothetical, future) __arrow_c_device_array__ and/or __arrow_c_device_array_stream__; however, converting those structures to the equivalent non-device versions in C is fairly difficult (would require something like nanoarrow or much wider support for working with those structs in the ecosystem).

If a future consumer wants a device array but is given an object that only implements __arrow_c_array__ or __arrow_c_array_stream__, chances are they have an implementation at their disposal that can make that conversion (whereas the reverse is less likely to be true).

@pitrou
Copy link
Member

pitrou commented Sep 19, 2023

The device interface is still experimental at this point while the C Data Interface is stable. Besides, most developers do not have any experience with non-CPU devices, making the device interface more difficult to implement for them.

@wjones127
Copy link
Member

If there's no enthusiasm for the device interface here right now, I'm fine with deferring that to an extension of this API once it's established.

@jorisvandenbossche
Copy link
Member Author

For reference, @wjones127 opened a PR describing what is discussed above in a spec document, and with implementation for pyarrow -> #37797

@jorisvandenbossche
Copy link
Member Author

Pinging some people from libraries that currently already do use the Arrow C Data interface to consume (or produce) arrow data, and currently typically use the _export_to_c method to get the C struct pointers. Since this proposal is exactly for those use cases (and long term ideally people move to use this protocol instead of relying on the pyarrow-specific _export_to_c), letting you know in case you have feedback on the general proposal (see the PyCapsuleInterface.rst file in the PR #37797 for the most up to date description), or if you see any potential problem in adopting this.

cc @Mytherin @pdet for duckdb, you currently call _export_to_c on pyarrow objects in your C++ code, that could be replaced by this protocol (and then also wouldn't be limited to pyarrow objects)

@wjones127 it seems you have been recently committing to the arrow-rs code that does this conversion (https://github.com/apache/arrow-rs/blob/master/arrow/src/pyarrow.rs)

@xwu99 for xgboost using the C interface to support Arrow data (dmlc/xgboost@613ec36)

@ritchie46 for polars (https://github.com/pola-rs/polars/blob/main/py-polars/src/arrow_interop/to_rust.rs and https://github.com/pola-rs/pyo3-polars/blob/main/pyo3-polars/src/ffi/to_rust.rs)

@amunra for py-questdb-client (https://github.com/questdb/py-questdb-client/blob/4584366f6afafcdac4f860354c48b78da8589eb4/src/questdb/dataframe.pxi#L808)

Some other places where we also want to update this in the Arrow projects itself are nanoarrow, adbc, the R package, arrow-rs.

@paleolimbot
Copy link
Member

Just pointing to the nanoarrow/Python example where I am excited to replace the existing _export_to_c()! https://github.com/apache/arrow-nanoarrow/blob/main/python/src/nanoarrow/lib.py#L21-L70

@amunra
Copy link

amunra commented Sep 28, 2023

Good idea! This API should be public and documented and dunder methods are a great way to do it. There should also be an equivalent APIs to do the mirror opposite: C ptr to pyarrow.

py-questdb-client uses the C API to iterate through Pandas dataframe quickly. We want to support a wide range of versions for maximum compatibility. In other words, we rely on duck-typing rather than relying on specific dependency versions.

That said:

There ought to be documentation on how to support both APIs (via duck typing) and any differences between them. E.g. What is a PyCapsule?

Nice efforts!

@pitrou
Copy link
Member

pitrou commented Sep 28, 2023

There ought to be documentation on how to support both APIs (via duck typing) and any differences between them. E.g. What is a PyCapsule?

The APIs based on raw C pointers (_export_to_c and _import_from_c) are internal APIs, and their use is unsafe: first because they are entirely untyped (the C pointer is passed as a Python integer), second because the exported pointer does not release its pointee when you go out of scope. In other words, a call to _export_to_c which is not followed by _import_from_c (for example because an exception happened in-between) leaks the exported schema/array.

The goal of the PyCapsule-based protocols is to be 1) reasonably type-safe, 2) ensure proper memory deallocation when the PyCapsule goes out of scope.

The documentation should probably provide examples of how to deal with the PyCapsule objects: 1) in Cython 2) in pure C.

@wjones127
Copy link
Member

FYI I've created rendered docs for the proposed protocol here: http://crossbow.voltrondata.com/pr_docs/37797/format/CDataInterface/PyCapsuleInterface.html

@jorisvandenbossche jorisvandenbossche added this to the 14.0.0 milestone Oct 11, 2023
@raulcd raulcd removed this from the 14.0.0 milestone Oct 18, 2023
@raulcd
Copy link
Member

raulcd commented Oct 18, 2023

I am creating RC0 at the moment if there are further Release candidates we can try and include it.

pitrou pushed a commit that referenced this issue Oct 18, 2023
### Rationale for this change

### What changes are included in this PR?

* A new specification for Arrow PyCapsules and related dunder methods
* Implementing the dunder methods for `DataType`, `Field`, `Schema`, `Array`, `RecordBatch`, `Table`, and `RecordBatchReader`.

### Are these changes tested?

Yes, I've added various roundtrip tests for each of the types.

### Are there any user-facing changes?

This introduces some new APIs and documents them.

* Closes: #34031
* Closes: #35531

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@pitrou pitrou added this to the 15.0.0 milestone Oct 18, 2023
@jorisvandenbossche jorisvandenbossche modified the milestones: 15.0.0, 14.0.0 Oct 18, 2023
raulcd pushed a commit that referenced this issue Oct 18, 2023
### Rationale for this change

### What changes are included in this PR?

* A new specification for Arrow PyCapsules and related dunder methods
* Implementing the dunder methods for `DataType`, `Field`, `Schema`, `Array`, `RecordBatch`, `Table`, and `RecordBatchReader`.

### Are these changes tested?

Yes, I've added various roundtrip tests for each of the types.

### Are there any user-facing changes?

This introduces some new APIs and documents them.

* Closes: #34031
* Closes: #35531

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@jorisvandenbossche
Copy link
Member Author

I opened two follow-up issues now the experimental spec and a part of the pyarrow implementation is merged:

JerAguilon pushed a commit to JerAguilon/arrow that referenced this issue Oct 23, 2023
…37797)

### Rationale for this change

### What changes are included in this PR?

* A new specification for Arrow PyCapsules and related dunder methods
* Implementing the dunder methods for `DataType`, `Field`, `Schema`, `Array`, `RecordBatch`, `Table`, and `RecordBatchReader`.

### Are these changes tested?

Yes, I've added various roundtrip tests for each of the types.

### Are there any user-facing changes?

This introduces some new APIs and documents them.

* Closes: apache#34031
* Closes: apache#35531

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…37797)

### Rationale for this change

### What changes are included in this PR?

* A new specification for Arrow PyCapsules and related dunder methods
* Implementing the dunder methods for `DataType`, `Field`, `Schema`, `Array`, `RecordBatch`, `Table`, and `RecordBatchReader`.

### Are these changes tested?

Yes, I've added various roundtrip tests for each of the types.

### Are there any user-facing changes?

This introduces some new APIs and documents them.

* Closes: apache#34031
* Closes: apache#35531

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…37797)

### Rationale for this change

### What changes are included in this PR?

* A new specification for Arrow PyCapsules and related dunder methods
* Implementing the dunder methods for `DataType`, `Field`, `Schema`, `Array`, `RecordBatch`, `Table`, and `RecordBatchReader`.

### Are these changes tested?

Yes, I've added various roundtrip tests for each of the types.

### Are there any user-facing changes?

This introduces some new APIs and documents them.

* Closes: apache#34031
* Closes: apache#35531

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
jorisvandenbossche added a commit that referenced this issue Jun 26, 2024
… Data support (#40708)

### Rationale for this change

We defined a protocol exposing the C Data Interface (schema, array and stream) in Python through PyCapsule objects and dunder methods `__arrow_c_schema/array/stream__` (#35531 / #37797).

We also expanded the C Data Interface with device capabilities: https://arrow.apache.org/docs/dev/format/CDeviceDataInterface.html (#34972).

This expands the Python exposure of the interface with support for the newer Device structs.

### What changes are included in this PR?

Update the specification to defined two additional dunders:

* `__arrow_c_device_array__` returns a pair of PyCapsules containing a C ArrowSchema and ArrowDeviceArray, where the latter uses "arrow_device_array" for the capsule name
* `__arrow_c_device_stream__` returns a PyCapsule containing a C ArrowDeviceArrayStream, where the capsule must have a name of "arrow_device_array_stream" 

### Are these changes tested?

Spec-only change

* GitHub Issue: #38325

Lead-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Dewey Dunnington <dewey@dunnington.ca>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Matt Topol <zotthewizard@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
zanmato1984 pushed a commit to zanmato1984/arrow that referenced this issue Jul 9, 2024
…Device Data support (apache#40708)

### Rationale for this change

We defined a protocol exposing the C Data Interface (schema, array and stream) in Python through PyCapsule objects and dunder methods `__arrow_c_schema/array/stream__` (apache#35531 / apache#37797).

We also expanded the C Data Interface with device capabilities: https://arrow.apache.org/docs/dev/format/CDeviceDataInterface.html (apache#34972).

This expands the Python exposure of the interface with support for the newer Device structs.

### What changes are included in this PR?

Update the specification to defined two additional dunders:

* `__arrow_c_device_array__` returns a pair of PyCapsules containing a C ArrowSchema and ArrowDeviceArray, where the latter uses "arrow_device_array" for the capsule name
* `__arrow_c_device_stream__` returns a PyCapsule containing a C ArrowDeviceArrayStream, where the capsule must have a name of "arrow_device_array_stream" 

### Are these changes tested?

Spec-only change

* GitHub Issue: apache#38325

Lead-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Dewey Dunnington <dewey@dunnington.ca>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Matt Topol <zotthewizard@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants