-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add PyCapsule support for Arrow import and export #825
Conversation
You can see my casting implementation here. But note that's slightly different as I'm using an |
Ok, added the requested schema and unit tests for export. All that I think is left is unit tests on the import. I'm not sure if I should keep the nanoarrow in the unit test. If so I need to add it to the requirements files. |
docs/source/user-guide/io/arrow.rst
Outdated
important to note that this will cause the DataFrame execution to happen, which may be | ||
a time consuming task. That is, you will cause a :py:func:`datafusion.dataframe.DataFrame.collect` | ||
operation call to occur. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest putting this into an "admonition" box with a warning color to make this clearer. I'm not sure how to do that in sphinx, but this is what I'm referring to in mkdocs-material: https://squidfunk.github.io/mkdocs-material/reference/admonitions/#supported-types
``__arrow_c_stream__`` or ``__arrow_c_array__``. For the latter, it must return | ||
a struct array. Common examples of sources from pyarrow include |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For both they must emit a struct array. Any Arrow array can be passed through an __arrow_c_stream__
. Canonically, to transfer a DataFrame you have a stream of struct arrays where each one is unpacked to be the columns of a RecordBatch. But it doesn't have to a struct array: you can also transfer a Series
through an __arrow_c_stream__
, where each batch in the stream iterator is just a primitive array.
merged_schema.project(&project_indices) | ||
} | ||
|
||
fn record_batch_into_schema( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am surprised that arrow-rs
nor datafusion
have such a utility for converting a record-batch, but I did take a quick look around and didn't find anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well there is cast
. Cast works on struct arrays, so you could make a simple wrapper around cast
to work on RecordBatch
by creating a struct array from the record batch. This is what I do in pyo3-arrow.
The main difference is that cast doesn't also project. It's not clear to me whether the PyCapsule Interface intends to support projection or not. I don't think anyone has asked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the user isn't calling the pycapsule interface directly, it's also not clear how the user API would look to ask for a projection via pycapsules.
I applied your suggestions. Do either of you want a re-review before we ask to merge? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent.
…g much over pyarrow
Co-authored-by: Michael J Ward <Michael-J-Ward@users.noreply.github.com>
f1ef382
to
955d7d5
Compare
Which issue does this PR close?
Closes #752
Rationale for this change
User requested.
What changes are included in this PR?
With this change you can import any arrow table that implements the PyCapsule Interface using the
SessionContext.from_arrow_table
function. Additionally, PyCapsule export of DataFrame is added. Now any python based project that uses python arrow with the pycapsule interface can directly consume a datafusion dataframe.The DataFrame will be executed at the point of export.
You can see a minimal example in the issue ticket.
Are there any user-facing changes?
This PR adds
SessionContext.from_arrow
which served the same purpose asfrom_arrow_table
except that it now takes any object that implements the required PyCapsule functions.from_arrow_table
is now an alias tofrom_arrow
.Still to do: