-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DEP: make pyarrow required dependency #50285
Comments
I am -1 on this for now. pyarrow is quite a large dependency, and I would wait until its more integrated before doing this.
I think this would be pretty cool. However, it would make build/infra work harder(ABI compatibility is tricky). I'm also not sure how I feel about C++(I guess we already use C++ stdlib in rolling/window Cython code?). This also might be something we can do without making pyarrrow mandatory. IIUC, we just need to compile against the minimum version of pyarrow(if it has ABI compat like numpy), but we can make importing the Cython extension module lazy. (So it would be a build dep, but not a runtime dep)
It's probably better to live with two implementations(numpy and arrow) for now. (Maybe off-topic, but it might be nice to have nullable by default before then, since all of pyarrows dtypes are nullable I think) Also cc @jorisvandenbossche |
Yes, I also don't think that makes sense at this point. For example, I think that would certainly make sense if we would consider using arrow memory by default, but as long as that work is experimental, I think the dependency can also be optional? (and I am all for using pyarrow more and more in pandas, but I think all that work can happen while pyarrow being optional?) In addition, I think pyarrow should also first figure out how it can be distributed in separate smaller packages (to counteract the "pyarrow is quite a large dependency") before we take it as a required dependency.
There are different aspects here, I think, and that is 1) using arrow data in our cython code (like we interact with numpy arrays) and 2) using actual pyarrow cython APIs ( The second is quite a complication to packaging (wheels). For example, we don't guarantee stability like numpy, and so typically you have to pin to a specific version (I think this is required, I have never been involved in a package that does this myself). Depending on whether you link with arrow-cpp parts, you might also need to include those libs in the wheel as well (eg the snowflake python connector currently does this). However, depending on what the goal would be, one doesn't necessarily need the cython APIs. To be able to work with arrow data in cython, there are other options:
|
These seem like good reasons to hold off, particularly the packaging/ABI thing. |
Couldn't we just use arrow as a build time dependency then do runtime checks if it is installed before calling its Cython functions? |
As mentioned above, (py)arrow doesn't guarantee a stable C ABI, so I am not sure if that's feasible without pinning to an exact version (which I don't think we want to do). To be clear, I never tried this, so I am honestly not sure how it exactly would work for the pyarrow cython APIs. Further, it's not fully clear to me what we would exactly want to do with pyarrow in cython. If it's "just" a matter of accessing the data (the buffers), you can unpack those without requiring pyarrow in cython (either unpacking before using pyarrow, or using the c data interface, see the two bullet points at the end of my message above) |
Yea understood on the stability, though that's going to be a bit of a chicken or the egg thing too. We can wait for arrow to decide what should be stable, or we can start building against some things knowing the risks and just pinning versions / providing compat as needed. Definitely agree to your larger point of needing refinement on the things we expect to use, just don't want that research to be dissuaded by some of the other issues listed that I think, while not ideal, also aren't technically impossible |
In case I was unclear in the OP, I said "poking" at cython usage because I have no idea what is actually there and if there is anything we would actually use. I would hope there are cool things available/coming, but have nothing in mind. |
I guess if we do start implementing e.g. ArrowIndexingEngine it would be nice to have tight type declarations |
The main reason I am looking forward to making pyarrow required is that there are a bunch of issues made much easier to solve with pyarrow dtypes. e.g. #22720 |
Closing in favor of a more targeted issue to be opened by @phofl |
Doing this in 2.0 would open up options in 2.x. Off the top of my head:
The text was updated successfully, but these errors were encountered: