Consider moving pyarrow's pandas compatibility and conversion code to the pandas project? #59780

jorisvandenbossche · 2024-09-11T13:45:51Z

This issue is to discuss the idea of moving a significant part of the pandas conversion and compatibility code that currently lives in pyarrow to the pandas project itself. Of course we would keep all low-level conversions (e.g. everything that lives in pyarrow C++) at the array-level in pyarrow itself (i.e. what pandas would use), but I think that a large part of what lives in pyarrow/pandas_compat.py could live in pandas.

Some reasons to do this:

It's a lot of pandas specific code that might "fit" better in pandas itself
It would allow pandas to control the conversion more tightly
- Example: now with upcoming pandas 3.0 and the new string dtype, pandas could ensure to use that new dtype in any conversion, while now with older versions of pyarrow to_pandas() will still give object dtype ([Python] Support pandas future default string dtype apache/arrow#43683)
The required low-level functionality in pyarrow should now also be stable enough to allow having this code live in pandas itself (which might not have been the case at the inception of pyarrow)

A potential downside is that it makes the dependency structure even more complex (pyarrow's to_pandas() relying on pandas relying on pyarrow), although pyarrow already has infrastructure set up to lazily import pandas today.

The idea is not that we would change any public pyarrow API that supports pandas (ingesting pandas in various pyarrow constructors, to_pandas() methods on objects, etc), but that at least for the DataFrame and Series level, pyarrow would under the hood rely on a method from pandas to do that conversion.
For example, I think that most of the handling of the "pandas metadata" (to guarantee a better pandas <-> arrow roundtrip) could live in pandas itself, or the code to convert column labels to strings and reconstruct an Index in the other direction, determining which columns should be converted as an extension dtype vs numpy dtype, etc

There are of course a lot of details to figure out, but wanted to already open the issue to get a general idea of what people think about this, and if we want to maintain this in pandas.

Equivalent issue on the pyarrow side: apache/arrow#44068

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2024-09-11T13:52:07Z

Two additional notes:

If this idea would be welcomed, as pyarrow maintainer, I would be happy to work on the actual migration and refactor of code
Short term, I would just essentially migrate the code (and cleaning up / refactoring it while doing that), i.e. keep using pyarrow to do this conversion. Depending on the discussions in PDEP-15 we could later also consider having a version of this that does not depend on pyarrow (but in any case, I think the bulk of the code will be python code dealing with metadata and index conversion etc, and not the actual conversion of arrays, and that part is independent of using pyarrow or something else for the construction of the arrow memory)

jorisvandenbossche added Needs Discussion Requires discussion from core team before further action Arrow pyarrow functionality labels Sep 11, 2024

jorisvandenbossche mentioned this issue Sep 11, 2024

[Python] Move pandas compatibility and conversion code to the pandas project? apache/arrow#44068

Open

jorisvandenbossche mentioned this issue Nov 14, 2024

BUG: read_parquet converts pyarrow list type to numpy dtype #53011

Open

3 tasks

jorisvandenbossche mentioned this issue Jan 5, 2025

API (string dtype): comparisons between different string classes #60639

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Consider moving pyarrow's pandas compatibility and conversion code to the pandas project? #59780

Consider moving pyarrow's pandas compatibility and conversion code to the pandas project? #59780

jorisvandenbossche commented Sep 11, 2024

jorisvandenbossche commented Sep 11, 2024

Uh oh!

Uh oh!

Consider moving pyarrow's pandas compatibility and conversion code to the pandas project? #59780

Consider moving pyarrow's pandas compatibility and conversion code to the pandas project? #59780

Comments

jorisvandenbossche commented Sep 11, 2024

jorisvandenbossche commented Sep 11, 2024

Uh oh!