DEP: make pyarrow required dependency #50285

jbrockmendel · 2022-12-15T20:18:34Z

Doing this in 2.0 would open up options in 2.x. Off the top of my head:

Poking at the cython pyarrow API
Transitioning some of our internal EAs to be pyarrow-backed

lithomas1 · 2022-12-16T00:46:58Z

I am -1 on this for now. pyarrow is quite a large dependency, and I would wait until its more integrated before doing this.

Poking at the cython pyarrow API

I think this would be pretty cool. However, it would make build/infra work harder(ABI compatibility is tricky). I'm also not sure how I feel about C++(I guess we already use C++ stdlib in rolling/window Cython code?).
Ideally, we should be able to use the Python buffer protocol(Cython memoryviews) to operate on pyarrow.

This also might be something we can do without making pyarrrow mandatory. IIUC, we just need to compile against the minimum version of pyarrow(if it has ABI compat like numpy), but we can make importing the Cython extension module lazy. (So it would be a build dep, but not a runtime dep)

Transitioning some of our internal EAs to be pyarrow-backed

It's probably better to live with two implementations(numpy and arrow) for now.

(Maybe off-topic, but it might be nice to have nullable by default before then, since all of pyarrows dtypes are nullable I think)

Also cc @jorisvandenbossche

jorisvandenbossche · 2022-12-16T14:54:26Z

Yes, I also don't think that makes sense at this point. For example, I think that would certainly make sense if we would consider using arrow memory by default, but as long as that work is experimental, I think the dependency can also be optional?

(and I am all for using pyarrow more and more in pandas, but I think all that work can happen while pyarrow being optional?)

In addition, I think pyarrow should also first figure out how it can be distributed in separate smaller packages (to counteract the "pyarrow is quite a large dependency") before we take it as a required dependency.

Poking at the cython pyarrow API

There are different aspects here, I think, and that is 1) using arrow data in our cython code (like we interact with numpy arrays) and 2) using actual pyarrow cython APIs (cimport from pyarrow).

The second is quite a complication to packaging (wheels). For example, we don't guarantee stability like numpy, and so typically you have to pin to a specific version (I think this is required, I have never been involved in a package that does this myself). Depending on whether you link with arrow-cpp parts, you might also need to include those libs in the wheel as well (eg the snowflake python connector currently does this).

However, depending on what the goal would be, one doesn't necessarily need the cython APIs. To be able to work with arrow data in cython, there are other options:

Use the python buffer protocol to view those as memory views, and this pass the raw data from python to our cython algorithms (currently it exposes the buffers as generic bytes array, so that means you need to reinterpret it as the correct data type, eg arr = pa.array([1, 2, 3], type="int32"); view = np.array(arr.buffers()[1]).view(np.int32)). Of course, that means we have to take care of handling missing values ourselves (but I think that is generally expected for our cython algorithms, that's the same for our own masked arrays)
Use the Arrow C data interface to access those data at the C / cython level (https://arrow.apache.org/docs/dev/format/CDataInterface.html). There are helpers in development to work with this more easily (https://github.com/apache/arrow-nanoarrow), and this allows you to consume and produce arrow data without depending on the arrow C++ / pyarrow library.

jbrockmendel · 2022-12-16T16:50:41Z

These seem like good reasons to hold off, particularly the packaging/ABI thing.

WillAyd · 2023-03-03T16:16:12Z

Couldn't we just use arrow as a build time dependency then do runtime checks if it is installed before calling its Cython functions?

jorisvandenbossche · 2023-03-03T16:22:26Z

As mentioned above, (py)arrow doesn't guarantee a stable C ABI, so I am not sure if that's feasible without pinning to an exact version (which I don't think we want to do). To be clear, I never tried this, so I am honestly not sure how it exactly would work for the pyarrow cython APIs.

Further, it's not fully clear to me what we would exactly want to do with pyarrow in cython. If it's "just" a matter of accessing the data (the buffers), you can unpack those without requiring pyarrow in cython (either unpacking before using pyarrow, or using the c data interface, see the two bullet points at the end of my message above)

WillAyd · 2023-03-03T16:29:49Z

Yea understood on the stability, though that's going to be a bit of a chicken or the egg thing too. We can wait for arrow to decide what should be stable, or we can start building against some things knowing the risks and just pinning versions / providing compat as needed.

Definitely agree to your larger point of needing refinement on the things we expect to use, just don't want that research to be dissuaded by some of the other issues listed that I think, while not ideal, also aren't technically impossible

jbrockmendel · 2023-03-03T16:31:38Z

In case I was unclear in the OP, I said "poking" at cython usage because I have no idea what is actually there and if there is anything we would actually use. I would hope there are cool things available/coming, but have nothing in mind.

jbrockmendel · 2023-03-03T16:36:19Z

I guess if we do start implementing e.g. ArrowIndexingEngine it would be nice to have tight type declarations

jbrockmendel · 2023-03-03T16:49:42Z

The main reason I am looking forward to making pyarrow required is that there are a bunch of issues made much easier to solve with pyarrow dtypes. e.g. #22720

jbrockmendel · 2023-04-04T14:27:45Z

Closing in favor of a more targeted issue to be opened by @phofl

jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 15, 2022

lithomas1 added Dependencies Required and optional dependencies Arrow pyarrow functionality Needs Discussion Requires discussion from core team before further action and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 16, 2022

simonjayhawkins changed the title ~~DEP: make pyarrow required in 2.0~~ DEP: make pyarrow required dependency Feb 21, 2023

jbrockmendel mentioned this issue Mar 3, 2023

Provide a way to convert Arrow tables to Arrow-backed dataframes #51760

Closed

mroeschke mentioned this issue Mar 17, 2023

BUG: Pyarrow scalars inferred as object dtype with arrow backend enabled whwn assigning a new column #52056

Open

3 tasks

jbrockmendel closed this as completed Apr 4, 2023

datapythonista mentioned this issue Apr 7, 2023

Make pyarrow a required dependency #52509

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DEP: make pyarrow required dependency #50285

DEP: make pyarrow required dependency #50285

jbrockmendel commented Dec 15, 2022 •

edited

Loading

lithomas1 commented Dec 16, 2022 •

edited

Loading

jorisvandenbossche commented Dec 16, 2022

jbrockmendel commented Dec 16, 2022

WillAyd commented Mar 3, 2023

jorisvandenbossche commented Mar 3, 2023

WillAyd commented Mar 3, 2023

jbrockmendel commented Mar 3, 2023

jbrockmendel commented Mar 3, 2023

jbrockmendel commented Mar 3, 2023

jbrockmendel commented Apr 4, 2023

DEP: make pyarrow required dependency #50285

DEP: make pyarrow required dependency #50285

Comments

jbrockmendel commented Dec 15, 2022 • edited Loading

lithomas1 commented Dec 16, 2022 • edited Loading

jorisvandenbossche commented Dec 16, 2022

jbrockmendel commented Dec 16, 2022

WillAyd commented Mar 3, 2023

jorisvandenbossche commented Mar 3, 2023

WillAyd commented Mar 3, 2023

jbrockmendel commented Mar 3, 2023

jbrockmendel commented Mar 3, 2023

jbrockmendel commented Mar 3, 2023

jbrockmendel commented Apr 4, 2023

jbrockmendel commented Dec 15, 2022 •

edited

Loading

lithomas1 commented Dec 16, 2022 •

edited

Loading