-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Support Arrow PyCapsule Interface for export #786
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think that makes sense! There's not much you can do if the underlying dataframe doesn't support it. If you had some way to access an arrow table from the underlying dataframe you could do something more, but I think this is good
cool, thanks!
we do have |
In that case I'd suggest def __arrow_c_stream__(self, requested_schema: object | None = None) -> object:
try:
return self._compliant_frame._native_frame.__arrow_c_stream__(
requested_schema=requested_schema
)
except AttributeError:
pyarrow_table = self.to_arrow()
return pyarrow_table.__arrow_c_stream__(requested_schema=requested_schema) So in the first case, we can use the source's implementation of the pycapsule interface and pyarrow doesn't need to be installed, while in the second case we can still ensure the method never raises an exception. |
thanks! though I think the user could call this themselves? e.g. df: nw.DataFrame
try:
result = df.__arrow_c_stream__(requested_schema=requested_schema)
except AttributeError:
result = df.to_arrow().__arrow_c_stream__(requested_schema=requested_schema) ? |
Well, one of my primary arguments for the pycapsule interface is that it allows an ecosystem of data producers and consumers to interoperate without any knowledge of the other, solely by looking for an |
I was thinking more of the Vegafusion case - I think it's better for them if they choose to explicitly call I think any library developer using Narwhals (e.g. vegafusion) would and should know about Narwhals, whereas lower-level libraries like PyArrow shouldn't:
|
In any case, as they often say, "in open source, 'no' is temporary but 'yes' is forever" - doubly so with our stable api policy π So, as the current implementation looks good to you, I'd say - let's start with that, we can always loosen it later if necessary Thanks for your review and input, much appreciated π ! |
As a general note, consumers can't do this because consumers don't have a way to know whether the source is a table that already exists in memory or whether it's a stream that can only be called once. E.g. a pyarrow |
Ah nice, thanks for explaining! In that case, I'm leaning more towards your suggestion - it would also mean being able to support this for versions of pandas prior to 2.2 but which support converting to pyarrow table |
raise ModuleNotFoundError(msg) from exc | ||
if parse_version(pa.__version__) < (14, 0): # pragma: no cover | ||
msg = f"PyArrow>=14.0.0 is required for `__arrow_c_stream__` for object of type {type(native_series)}" | ||
raise ModuleNotFoundError(msg) | ||
ca = pa.chunked_array([self.to_arrow()]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might require pyarrow 15
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in pandas the requirement is PyArrow 14+ (I also just ran the tests with pyarrow 13 and 14 - the former fails, the latter passes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah sorry, that's for DataFrame. looks like it's even PyArrow 16+ for chunkedarray?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was added to pa.chunked_array in a later release, yes. I think it was here: apache/arrow#40818
Wow I am learning a bunch from this PR ππΌ @MarcoGorelli probably worth adding these methods in the api docs as well π?! |
thanks Kyle for your help! cool, let's ship this π’ |
closes #784
@kylebarron fancy taking a look to see if this is what needs doing / if I've understood the assignment?
What type of PR is this? (check all applicable)
Related issues
Checklist
If you have comments or can explain your changes, please do so below.