Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEP: make pyarrow required dependency #50285

Closed
jbrockmendel opened this issue Dec 15, 2022 · 10 comments
Closed

DEP: make pyarrow required dependency #50285

jbrockmendel opened this issue Dec 15, 2022 · 10 comments
Labels
Arrow pyarrow functionality Dependencies Required and optional dependencies Needs Discussion Requires discussion from core team before further action

Comments

@jbrockmendel
Copy link
Member

jbrockmendel commented Dec 15, 2022

Doing this in 2.0 would open up options in 2.x. Off the top of my head:

  1. Poking at the cython pyarrow API
  2. Transitioning some of our internal EAs to be pyarrow-backed
@jbrockmendel jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 15, 2022
@lithomas1
Copy link
Member

lithomas1 commented Dec 16, 2022

I am -1 on this for now. pyarrow is quite a large dependency, and I would wait until its more integrated before doing this.

  1. Poking at the cython pyarrow API

I think this would be pretty cool. However, it would make build/infra work harder(ABI compatibility is tricky). I'm also not sure how I feel about C++(I guess we already use C++ stdlib in rolling/window Cython code?).
Ideally, we should be able to use the Python buffer protocol(Cython memoryviews) to operate on pyarrow.

This also might be something we can do without making pyarrrow mandatory. IIUC, we just need to compile against the minimum version of pyarrow(if it has ABI compat like numpy), but we can make importing the Cython extension module lazy. (So it would be a build dep, but not a runtime dep)

  1. Transitioning some of our internal EAs to be pyarrow-backed

It's probably better to live with two implementations(numpy and arrow) for now.

(Maybe off-topic, but it might be nice to have nullable by default before then, since all of pyarrows dtypes are nullable I think)

Also cc @jorisvandenbossche

@lithomas1 lithomas1 added Dependencies Required and optional dependencies Arrow pyarrow functionality Needs Discussion Requires discussion from core team before further action and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 16, 2022
@jorisvandenbossche
Copy link
Member

Yes, I also don't think that makes sense at this point. For example, I think that would certainly make sense if we would consider using arrow memory by default, but as long as that work is experimental, I think the dependency can also be optional?

(and I am all for using pyarrow more and more in pandas, but I think all that work can happen while pyarrow being optional?)

In addition, I think pyarrow should also first figure out how it can be distributed in separate smaller packages (to counteract the "pyarrow is quite a large dependency") before we take it as a required dependency.

Poking at the cython pyarrow API

There are different aspects here, I think, and that is 1) using arrow data in our cython code (like we interact with numpy arrays) and 2) using actual pyarrow cython APIs (cimport from pyarrow).

The second is quite a complication to packaging (wheels). For example, we don't guarantee stability like numpy, and so typically you have to pin to a specific version (I think this is required, I have never been involved in a package that does this myself). Depending on whether you link with arrow-cpp parts, you might also need to include those libs in the wheel as well (eg the snowflake python connector currently does this).

However, depending on what the goal would be, one doesn't necessarily need the cython APIs. To be able to work with arrow data in cython, there are other options:

  • Use the python buffer protocol to view those as memory views, and this pass the raw data from python to our cython algorithms (currently it exposes the buffers as generic bytes array, so that means you need to reinterpret it as the correct data type, eg arr = pa.array([1, 2, 3], type="int32"); view = np.array(arr.buffers()[1]).view(np.int32)). Of course, that means we have to take care of handling missing values ourselves (but I think that is generally expected for our cython algorithms, that's the same for our own masked arrays)
  • Use the Arrow C data interface to access those data at the C / cython level (https://arrow.apache.org/docs/dev/format/CDataInterface.html). There are helpers in development to work with this more easily (https://github.com/apache/arrow-nanoarrow), and this allows you to consume and produce arrow data without depending on the arrow C++ / pyarrow library.

@jbrockmendel
Copy link
Member Author

These seem like good reasons to hold off, particularly the packaging/ABI thing.

@simonjayhawkins simonjayhawkins changed the title DEP: make pyarrow required in 2.0 DEP: make pyarrow required dependency Feb 21, 2023
@WillAyd
Copy link
Member

WillAyd commented Mar 3, 2023

Couldn't we just use arrow as a build time dependency then do runtime checks if it is installed before calling its Cython functions?

@jorisvandenbossche
Copy link
Member

As mentioned above, (py)arrow doesn't guarantee a stable C ABI, so I am not sure if that's feasible without pinning to an exact version (which I don't think we want to do). To be clear, I never tried this, so I am honestly not sure how it exactly would work for the pyarrow cython APIs.

Further, it's not fully clear to me what we would exactly want to do with pyarrow in cython. If it's "just" a matter of accessing the data (the buffers), you can unpack those without requiring pyarrow in cython (either unpacking before using pyarrow, or using the c data interface, see the two bullet points at the end of my message above)

@WillAyd
Copy link
Member

WillAyd commented Mar 3, 2023

Yea understood on the stability, though that's going to be a bit of a chicken or the egg thing too. We can wait for arrow to decide what should be stable, or we can start building against some things knowing the risks and just pinning versions / providing compat as needed.

Definitely agree to your larger point of needing refinement on the things we expect to use, just don't want that research to be dissuaded by some of the other issues listed that I think, while not ideal, also aren't technically impossible

@jbrockmendel
Copy link
Member Author

In case I was unclear in the OP, I said "poking" at cython usage because I have no idea what is actually there and if there is anything we would actually use. I would hope there are cool things available/coming, but have nothing in mind.

@jbrockmendel
Copy link
Member Author

I guess if we do start implementing e.g. ArrowIndexingEngine it would be nice to have tight type declarations

@jbrockmendel
Copy link
Member Author

The main reason I am looking forward to making pyarrow required is that there are a bunch of issues made much easier to solve with pyarrow dtypes. e.g. #22720

@jbrockmendel
Copy link
Member Author

Closing in favor of a more targeted issue to be opened by @phofl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Dependencies Required and optional dependencies Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

4 participants