-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(python): Support PyCapsule Interface in DataFrame & Series constructors #17693
Conversation
Wouldn't it be simpler to import it as a Series and then convert it to a DetaFrame with |
Thanks for the advice! That is indeed easier, because we only have to touch the capsules from the Series impl. |
Thanks @kylebarron. Can you fix the tests? |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #17693 +/- ##
==========================================
+ Coverage 80.40% 80.49% +0.09%
==========================================
Files 1502 1504 +2
Lines 197041 197139 +98
Branches 2794 2810 +16
==========================================
+ Hits 158439 158696 +257
+ Misses 38088 37921 -167
- Partials 514 522 +8 ☔ View full report in Codecov by Sentry. |
I had to implement a workaround for empty streams because It would be ideal if that impl was fixed; I tried to but there are quite a few assumptions there that the vec is non-empty. For now the import code just calls to |
The checks for pycapsule objects were also moved after pyarrow and pandas-specific checks. Since those are already in place, it means that pyarrow and pandas objects will always be imported in the same way (with existing pyarrow and pandas-specific APIs) regardless of those versions. While any other libraries' objects will go through the new pycapsule API |
) -> PyResult<(arrow::datatypes::Field, Box<dyn Array>)> { | ||
validate_pycapsule_name(schema_capsule, "arrow_schema")?; | ||
validate_pycapsule_name(array_capsule, "arrow_array")?; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a // SAFETY
comment explaining which invariants must hold here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take a look and see if those safety comments are ok
Progress towards the import side of #12530.
This adds a check in the constructor of
DataFrame
andSeries
for input objects that have an__arrow_c_array__
or__arrow_c_stream__
. This means that polars can import a variety of Arrow-based objects via the Arrow PyCapsule interface.For reference, this table shows the various pyarrow objects that implement each method, but the pyarrow objects are only as examples, and crucially this also works with any other Python Arrow implementation, like
ibis.Table
,pandas.DataFrame
v2.2 and later,nanoarrow
objects, etc.pyarrow.Array
__arrow_c_array__
pyarrow.RecordBatch
__arrow_c_array__
pyarrow.ChunkedArray
__arrow_c_stream__
pyarrow.Table
__arrow_c_stream__
pyarrow.RecordBatchReader
__arrow_c_stream__
Note that this short-circuits pyarrow-specific handling. If desired, this could be checked after known pyarrow objects.The code can be cleaned up a bit (and some
unwraps
removed/fixed) but it's working, so I figure it's worth putting this up for feedback on the overall approach.