-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-35531: [Python] C Data Interface PyCapsule Protocol #37797
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot Will for putting this up!
I already took a look at the spec description, will try to review the implementation tomorrow as well.
One general question on that: do we generally expect to only handle with "data" objects that expose their schema through __arrow_c_schema__
? Or do we also want to enable / encourage libraries to pass actual schema objects around that just implement this interface? (here you implemented it for pyarrow.DataType/Field/Schema)
And question for ArrowArrayExportable
: it might be useful for generalization that an Array object (like our pyarrow.Array/RecordBatch
) also implements the stream interface, with just returning a single batch? That way, consumers that typically work with a stream of data (eg duckdb) could automatically support Array/RecordBatch as well, without needing specific code for this subset of objects.
|
6ec61ca
to
446606c
Compare
02fe895
to
6fff484
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some small mostly doc comments. And nice new examples section!
Can be for a follow-up as well, but should we add __arrow_c_stream__
to ChunkedArray?
@github-actions crossbow submit -g python -g wheel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, really great! Let's hope we can get it in 14.0.0, if only for the spec definition.
Revision: 51657bb Submitted crossbow builds: ursacomputing/crossbow @ actions-7f39314d16
|
This comment was marked as outdated.
This comment was marked as outdated.
…cating C Data Interface pointers
51657bb
to
87e5fc5
Compare
@github-actions crossbow submit -g python -g wheel |
Revision: 87e5fc5 Submitted crossbow builds: ursacomputing/crossbow @ actions-477302fc6e |
### Rationale for this change ### What changes are included in this PR? * A new specification for Arrow PyCapsules and related dunder methods * Implementing the dunder methods for `DataType`, `Field`, `Schema`, `Array`, `RecordBatch`, `Table`, and `RecordBatchReader`. ### Are these changes tested? Yes, I've added various roundtrip tests for each of the types. ### Are there any user-facing changes? This introduces some new APIs and documents them. * Closes: #34031 * Closes: #35531 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit d68f8e2. There were 2 benchmark results indicating a performance regression:
The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them. |
…37797) ### Rationale for this change ### What changes are included in this PR? * A new specification for Arrow PyCapsules and related dunder methods * Implementing the dunder methods for `DataType`, `Field`, `Schema`, `Array`, `RecordBatch`, `Table`, and `RecordBatchReader`. ### Are these changes tested? Yes, I've added various roundtrip tests for each of the types. ### Are there any user-facing changes? This introduces some new APIs and documents them. * Closes: apache#34031 * Closes: apache#35531 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
…37797) ### Rationale for this change ### What changes are included in this PR? * A new specification for Arrow PyCapsules and related dunder methods * Implementing the dunder methods for `DataType`, `Field`, `Schema`, `Array`, `RecordBatch`, `Table`, and `RecordBatchReader`. ### Are these changes tested? Yes, I've added various roundtrip tests for each of the types. ### Are there any user-facing changes? This introduces some new APIs and documents them. * Closes: apache#34031 * Closes: apache#35531 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
…37797) ### Rationale for this change ### What changes are included in this PR? * A new specification for Arrow PyCapsules and related dunder methods * Implementing the dunder methods for `DataType`, `Field`, `Schema`, `Array`, `RecordBatch`, `Table`, and `RecordBatchReader`. ### Are these changes tested? Yes, I've added various roundtrip tests for each of the types. ### Are there any user-facing changes? This introduces some new APIs and documents them. * Closes: apache#34031 * Closes: apache#35531 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
…psule (#41538) ### Rationale for this change Fixes the dropped `pa.schema` metadata reported in #38575, which was introduced in #37797. ### What changes are included in this PR? Passes through the `metadata` to the short-circuited `Schema` created with `_import_from_c_capsule`. ### Are these changes tested? Yes - added `metadata` to the existing test. ### Are there any user-facing changes? I'm not sure this quite rises to the `(b) a bug that caused incorrect or invalid data to be produced,` condition, but I added that note to be safe since the resulting schema is "incorrect" (and broke some round-trip tests on my end after a pyarrow update): **This PR contains a "Critical Fix".** * GitHub Issue: #38575 Lead-authored-by: Jacob Hayes <jacob.r.hayes@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…m PyCapsule (apache#41538) ### Rationale for this change Fixes the dropped `pa.schema` metadata reported in apache#38575, which was introduced in apache#37797. ### What changes are included in this PR? Passes through the `metadata` to the short-circuited `Schema` created with `_import_from_c_capsule`. ### Are these changes tested? Yes - added `metadata` to the existing test. ### Are there any user-facing changes? I'm not sure this quite rises to the `(b) a bug that caused incorrect or invalid data to be produced,` condition, but I added that note to be safe since the resulting schema is "incorrect" (and broke some round-trip tests on my end after a pyarrow update): **This PR contains a "Critical Fix".** * GitHub Issue: apache#38575 Lead-authored-by: Jacob Hayes <jacob.r.hayes@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
… Data support (#40708) ### Rationale for this change We defined a protocol exposing the C Data Interface (schema, array and stream) in Python through PyCapsule objects and dunder methods `__arrow_c_schema/array/stream__` (#35531 / #37797). We also expanded the C Data Interface with device capabilities: https://arrow.apache.org/docs/dev/format/CDeviceDataInterface.html (#34972). This expands the Python exposure of the interface with support for the newer Device structs. ### What changes are included in this PR? Update the specification to defined two additional dunders: * `__arrow_c_device_array__` returns a pair of PyCapsules containing a C ArrowSchema and ArrowDeviceArray, where the latter uses "arrow_device_array" for the capsule name * `__arrow_c_device_stream__` returns a PyCapsule containing a C ArrowDeviceArrayStream, where the capsule must have a name of "arrow_device_array_stream" ### Are these changes tested? Spec-only change * GitHub Issue: #38325 Lead-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Dewey Dunnington <dewey@dunnington.ca> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Co-authored-by: Matt Topol <zotthewizard@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…Device Data support (apache#40708) ### Rationale for this change We defined a protocol exposing the C Data Interface (schema, array and stream) in Python through PyCapsule objects and dunder methods `__arrow_c_schema/array/stream__` (apache#35531 / apache#37797). We also expanded the C Data Interface with device capabilities: https://arrow.apache.org/docs/dev/format/CDeviceDataInterface.html (apache#34972). This expands the Python exposure of the interface with support for the newer Device structs. ### What changes are included in this PR? Update the specification to defined two additional dunders: * `__arrow_c_device_array__` returns a pair of PyCapsules containing a C ArrowSchema and ArrowDeviceArray, where the latter uses "arrow_device_array" for the capsule name * `__arrow_c_device_stream__` returns a PyCapsule containing a C ArrowDeviceArrayStream, where the capsule must have a name of "arrow_device_array_stream" ### Are these changes tested? Spec-only change * GitHub Issue: apache#38325 Lead-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Dewey Dunnington <dewey@dunnington.ca> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Co-authored-by: Matt Topol <zotthewizard@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Rationale for this change
What changes are included in this PR?
DataType
,Field
,Schema
,Array
,RecordBatch
,Table
, andRecordBatchReader
.Are these changes tested?
Yes, I've added various roundtrip tests for each of the types.
Are there any user-facing changes?
This introduces some new APIs and documents them.