Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support PyArrow arrays and dataframes #2800

Open
weiji14 opened this issue Nov 7, 2023 · 3 comments
Open

Support PyArrow arrays and dataframes #2800

weiji14 opened this issue Nov 7, 2023 · 3 comments
Assignees
Labels
feature request New feature wanted help wanted Helping hands are appreciated longterm Long standing issues that need to be resolved
Milestone

Comments

@weiji14
Copy link
Member

weiji14 commented Nov 7, 2023

Description of the desired feature

Apache Arrow is an in-memory format that is starting to become a common exchange format between different libraries in Python and other programming languages. For example:

This issue is to track compatibility and support of different PyArrow data types in PyGMT:

Dtype Implementation PR Status Notes
Numerical (uint/int/float) #2774
String #2933 🚧 May require modifying the put_strings method that currently uses np.char.encode
Date/Time #2845 (pandas) and TODO (raw pyarrow) 🚧 May require modifying array_to_datetime that expects Python datetime or numpy-backed arrays, xref #242 and #3507
Duration TODO https://arrow.apache.org/docs/13.0/python/generated/pyarrow.duration.html, wait for #2884 also
Special case: geopandas.GeoDataFrame with PyArrow dtype columns TODO See #2774 (comment)
GeoArrow geometry TODO https://github.com/geoarrow/geoarrow-python

Simplest way of integrating would be to just handle PyArrow-backed pandas.Dataframe objects as above.

Alternatively, we can also discuss about using PyArrow as the internal array representation (which would make pyarrow a hard dependency) since it may allow better interoperability across other Python libraries using Arrow, and this might be relevant for #1318 and #2731. My thought is to do this through the __dataframe__ protocol, see https://arrow.apache.org/docs/python/interchange_protocol.html

Further reading:

Are you willing to help implement and maintain this feature?

Yes, but help is welcome too!

@weiji14 weiji14 added help wanted Helping hands are appreciated feature request New feature wanted labels Nov 7, 2023
@weiji14 weiji14 added this to the 0.12.0 milestone Nov 7, 2023
@weiji14 weiji14 self-assigned this Nov 7, 2023
@weiji14 weiji14 changed the title Support pyarrow arrays and dataframes Support PyArrow arrays and dataframes Nov 7, 2023
@seisman
Copy link
Member

seisman commented Dec 7, 2023

Before supporting pyarrow-backed pandas objects like what you're doing in PR #2774 and #2845, maybe we should check/support passing pyarrow arrays directly to PyGMT? If all/most pyarrow dtypes work, then we can go on with pyarrow-backed pandas objects. Then if interested, we may support polars.

@weiji14
Copy link
Member Author

weiji14 commented Dec 7, 2023

maybe we should check/support passing pyarrow arrays directly to PyGMT? If all/most pyarrow dtypes work, then we can go on with pyarrow-backed pandas objects. Then if interested, we may support polars.

Sure, I'd love to have direct support for PyArrow arrays too. I started with pyarrow-backed pandas objects because pandas 3.0 will eventually use PyArrow for string columns by default, but no reason we can't support passing a pyarrow.array object directly into PyGMT.

@weiji14
Copy link
Member Author

weiji14 commented Dec 9, 2023

check/support passing pyarrow arrays directly to PyGMT

Just opened a PR for this at #2864. Surprisingly, most PyGMT functions already work with pyarrow.array or pyarrow.table without any modification (I've tested blockm, info, nearneighbor, project, triangulate, xyz2grd so far), possibly because PyGMT can convert them internally to numpy.array (see e.g. https://arrow.apache.org/docs/python/numpy.html#arrow-to-numpy). Will need to test out more complicated dtypes and check for edge cases, but it's looking promising!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature wanted help wanted Helping hands are appreciated longterm Long standing issues that need to be resolved
Projects
None yet
Development

No branches or pull requests

2 participants