feat: Support Arrow PyCapsule Interface for export #786

MarcoGorelli · 2024-08-13T15:23:30Z

closes #784

@kylebarron fancy taking a look to see if this is what needs doing / if I've understood the assignment?

What type of PR is this? (check all applicable)

Related issues

Related issue #
Closes #

Checklist

Code follows style guide (ruff)
Tests added
Documented the changes

If you have comments or can explain your changes, please do so below.

kylebarron

Yeah I think that makes sense! There's not much you can do if the underlying dataframe doesn't support it. If you had some way to access an arrow table from the underlying dataframe you could do something more, but I think this is good

MarcoGorelli · 2024-08-13T16:02:51Z

cool, thanks!

If you had some way to access an arrow table from the underlying dataframe

we do have narwhals.DataFrame.to_arrow, which returns a pyarrow table - is that what you meant? If so, what else would you suggest adding?

kylebarron · 2024-08-13T16:11:40Z

In that case I'd suggest

def __arrow_c_stream__(self, requested_schema: object | None = None) -> object:
    try:
        return self._compliant_frame._native_frame.__arrow_c_stream__(
            requested_schema=requested_schema
        )
    except AttributeError:
        pyarrow_table = self.to_arrow()
        return pyarrow_table.__arrow_c_stream__(requested_schema=requested_schema)

So in the first case, we can use the source's implementation of the pycapsule interface and pyarrow doesn't need to be installed, while in the second case we can still ensure the method never raises an exception.

MarcoGorelli · 2024-08-13T16:27:06Z

thanks! though I think the user could call this themselves? e.g.

df: nw.DataFrame
try:
    result = df.__arrow_c_stream__(requested_schema=requested_schema)
except AttributeError:
    result = df.to_arrow().__arrow_c_stream__(requested_schema=requested_schema)

?
I think that'd be more explicit, not totally sure we should be calling to_arrow on behalf of the user. By 'user', I mean the library developer using Narwhals, I'd be inclined to leave it up to them whether or not to fallback to PyArrow

kylebarron · 2024-08-13T16:32:57Z

Well, one of my primary arguments for the pycapsule interface is that it allows an ecosystem of data producers and consumers to interoperate without any knowledge of the other, solely by looking for an __arrow_c_stream__ dunder method. Calling .to_arrow() would indeed be more explicit, but it would require library consumers to know about narwhals, which I'd argue in the general case is not true. E.g. pyarrow.table() only knows to check for __arrow_c_stream__.

MarcoGorelli · 2024-08-13T16:40:01Z

I was thinking more of the Vegafusion case - I think it's better for them if they choose to explicitly call to_arrow. Otherwise they might be calling __arrow_c_stream__ all over the places, thinking it's cheap, whereas it would've been better to do a single to_arrow upfront 😇

I think any library developer using Narwhals (e.g. vegafusion) would and should know about Narwhals, whereas lower-level libraries like PyArrow shouldn't:

PyArrow can just check for __arrow_c_stream__ (as it currently does)
Vegafusion can choose whether and when to call to_arrow before accessing __arrow_c_stream__

MarcoGorelli · 2024-08-13T16:43:28Z

In any case, as they often say, "in open source, 'no' is temporary but 'yes' is forever" - doubly so with our stable api policy 😆

So, as the current implementation looks good to you, I'd say - let's start with that, we can always loosen it later if necessary

Thanks for your review and input, much appreciated 🙏 !

kylebarron · 2024-08-13T16:57:39Z

Otherwise they might be calling __arrow_c_stream__ all over the places, thinking it's cheap, whereas it would've been better to do a single to_arrow upfront

As a general note, consumers can't do this because consumers don't have a way to know whether the source is a table that already exists in memory or whether it's a stream that can only be called once. E.g. a pyarrow RecordBatchReader is a stream and you can only call __arrow_c_stream__ once. So for a consumer like vegafusion, it would be important for it to import all the data once and then operate as it needs to on it.

MarcoGorelli · 2024-08-13T17:19:30Z

Ah nice, thanks for explaining!

In that case, I'm leaning more towards your suggestion - it would also mean being able to support this for versions of pandas prior to 2.2 but which support converting to pyarrow table

kylebarron · 2024-08-13T18:08:05Z

narwhals/series.py

+            raise ModuleNotFoundError(msg) from exc
+        if parse_version(pa.__version__) < (14, 0):  # pragma: no cover
+            msg = f"PyArrow>=14.0.0 is required for `__arrow_c_stream__` for object of type {type(native_series)}"
+            raise ModuleNotFoundError(msg)
        ca = pa.chunked_array([self.to_arrow()])


I think this might require pyarrow 15

in pandas the requirement is PyArrow 14+ (I also just ran the tests with pyarrow 13 and 14 - the former fails, the latter passes)

ah sorry, that's for DataFrame. looks like it's even PyArrow 16+ for chunkedarray?

It was added to pa.chunked_array in a later release, yes. I think it was here: apache/arrow#40818

FBruzzesi · 2024-08-13T19:00:43Z

Wow I am learning a bunch from this PR 🙌🏼

@MarcoGorelli probably worth adding these methods in the api docs as well 😇?!

MarcoGorelli · 2024-08-14T07:17:30Z

thanks Kyle for your help!

cool, let's ship this 🚢

feat: Support Arrow PyCapsule

256860e

github-actions bot added the enhancement New feature or request label Aug 13, 2024

kylebarron approved these changes Aug 13, 2024

View reviewed changes

kylebarron mentioned this pull request Aug 13, 2024

[Python] Promote usage of the Arrow PyCapsule Protocol (for the C Data Inteface) apache/arrow#39195

Open

8 tasks

MarcoGorelli added 2 commits August 13, 2024 18:34

fallback to pyarrow

9f19a73

set minimum pyarrow version

c03ac8c

kylebarron reviewed Aug 13, 2024

View reviewed changes

MarcoGorelli added 3 commits August 13, 2024 19:20

Merge remote-tracking branch 'upstream/main' into py-capsule

ec970c6

fixup

b937b5f

correct min version

086d45a

MarcoGorelli added 2 commits August 13, 2024 20:55

add to reference

f14f7ea

fixup

c738f7f

MarcoGorelli merged commit 350fe7d into narwhals-dev:main Aug 14, 2024
21 checks passed

This was referenced Aug 14, 2024

feat(python!): Use Altair in DataFrame.plot pola-rs/polars#17995

Merged

[python-package] Adding support for polars for input data microsoft/LightGBM#6204

Open

kylebarron mentioned this pull request Sep 4, 2024

Support Arrow PyCapsule Interface vega/altair#3568

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Support Arrow PyCapsule Interface for export #786

feat: Support Arrow PyCapsule Interface for export #786

MarcoGorelli commented Aug 13, 2024 •

edited

Loading

kylebarron left a comment

MarcoGorelli commented Aug 13, 2024

kylebarron commented Aug 13, 2024

MarcoGorelli commented Aug 13, 2024

kylebarron commented Aug 13, 2024

MarcoGorelli commented Aug 13, 2024

MarcoGorelli commented Aug 13, 2024

kylebarron commented Aug 13, 2024

MarcoGorelli commented Aug 13, 2024

kylebarron Aug 13, 2024

MarcoGorelli Aug 13, 2024

MarcoGorelli Aug 13, 2024 •

edited

Loading

kylebarron Aug 13, 2024

FBruzzesi commented Aug 13, 2024 •

edited

Loading

MarcoGorelli commented Aug 14, 2024

feat: Support Arrow PyCapsule Interface for export #786

feat: Support Arrow PyCapsule Interface for export #786

Conversation

MarcoGorelli commented Aug 13, 2024 • edited Loading

What type of PR is this? (check all applicable)

Related issues

Checklist

If you have comments or can explain your changes, please do so below.

kylebarron left a comment

Choose a reason for hiding this comment

MarcoGorelli commented Aug 13, 2024

kylebarron commented Aug 13, 2024

MarcoGorelli commented Aug 13, 2024

kylebarron commented Aug 13, 2024

MarcoGorelli commented Aug 13, 2024

MarcoGorelli commented Aug 13, 2024

kylebarron commented Aug 13, 2024

MarcoGorelli commented Aug 13, 2024

kylebarron Aug 13, 2024

Choose a reason for hiding this comment

MarcoGorelli Aug 13, 2024

Choose a reason for hiding this comment

MarcoGorelli Aug 13, 2024 • edited Loading

Choose a reason for hiding this comment

kylebarron Aug 13, 2024

Choose a reason for hiding this comment

FBruzzesi commented Aug 13, 2024 • edited Loading

MarcoGorelli commented Aug 14, 2024

MarcoGorelli commented Aug 13, 2024 •

edited

Loading

MarcoGorelli Aug 13, 2024 •

edited

Loading

FBruzzesi commented Aug 13, 2024 •

edited

Loading