Skip to content

ENH: Incorproate ArrowDtype into ArrowExtensionArray #47034

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 48 commits into from
Jun 9, 2022

Conversation

mroeschke
Copy link
Member

@mroeschke mroeschke commented May 16, 2022

Not fully user facing yet.

Supersedes #46972

cc @jbrockmendel let me know if this is what you had in mind

@mroeschke mroeschke mentioned this pull request May 16, 2022
3 tasks
@jbrockmendel
Copy link
Member

yah this is what i had in mind

@mroeschke mroeschke added Enhancement Arrow pyarrow functionality labels May 17, 2022
@mroeschke mroeschke added this to the 1.5 milestone May 17, 2022
@mroeschke
Copy link
Member Author

Have this in a pretty good state now.

  • Not user facing yet
  • BaseConstructorsTests and BaseGetitemTests are running. Will plan to add more in subsequent PRs

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments on the tests.

I also noticed that the equality of the dtype is not working properly (which I assume also has some effect on other things):

In [25]: from pandas.core.arrays.arrow import ArrowExtensionArray

In [26]: arr1 = ArrowExtensionArray(pa.array([1, 2, 3], pa.int64()))

In [27]: arr2 = ArrowExtensionArray(pa.array([1, 2, 3], pa.float64()))

In [28]: arr1.dtype
Out[28]: int64[pyarrow]

In [29]: arr2.dtype
Out[29]: double[pyarrow]

In [30]: arr1.dtype == arr2.dtype
Out[30]: True

@@ -91,12 +140,9 @@ def _get_common_dtype(self, dtypes: list[DtypeObj]) -> DtypeObj | None:
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something to note somewhere (rather for a follow-up) that the above will only work for pyarrow types that have a matching numpy dtype, which is not the case for all types (eg decimal, dictionary, date, nested types, etc)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I defaulted the numpy_dtype to np.dtype(object) to those pyarrow types without a corresponding numpy type.

https://github.com/pandas-dev/pandas/pull/47034/files#diff-251a2430af30e9f685a07dbee6d5c024e832fc7e10569c7d43ee46bc115b422dR56

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure that will help (but without having tests that cover this it is hard to say). If .numpy_dtype returns object dtype, then the common dtype here will also be object dtype, and then pa.from_numpy_dtype below will fail (so this function will return None). That means that there can never be a proper ArrowDtype common dtype (that is not object dtype) for such arrow types.

For example, if you have one array of decimal and one of float, the common dtype could be float (not sure we want to do this, but let's assume for the example). With the current implementation, the numpy_dtype for those extension dtypes will be np.dtype(object) and np.dtype(float). The common dtype for that will always be np.dtype(object), which means that concatting such columns will result in a cast to object dtype, instead of casting to/preserving the float dtype.

So at some point, we should probably include pyarrow-specific logic in here that doesn't rely on converting to a numpy dtype and numpy's notion of a common type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So at some point, we should probably include pyarrow-specific logic in here that doesn't rely on converting to a numpy dtype and numpy's notion of a common type.

Agreed, and makes sense that this shouldn't be object dtype in the long term. Would be great if this eventually follows pyarrow's type coercion rules if there is one :)

@mroeschke
Copy link
Member Author

In [30]: arr1.dtype == arr2.dtype

Awesome thanks for the second review.

I was able to address the dtype __eq__ comparison by adding pyarrow_dtype to _metadata. I am hoping this will be tested once I add the other base extension tests in subsequent PRs.

@mroeschke
Copy link
Member Author

Updated & all green

@jreback jreback merged commit f30c7d7 into pandas-dev:main Jun 9, 2022
@mroeschke mroeschke deleted the enh/arrowdtype_support branch June 9, 2022 18:57
yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022
self._dtype = ArrowDtype(self._data.type)

@classmethod
def _from_sequence(cls, scalars, *, dtype: Dtype | None = None, copy=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mroeschke i just tried the following and got an ArrowInvalid exception

arr = pa.array([1, 2, 3])
ea = pd.core.arrays.ArrowExtensionArray._from_sequence(arr)

should this work?

update: looks like just __init__ works fine here. still surprising that from_sequence doesnt

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to #48238, I hadn't really anticipated users passing pyarrow arrays but I suppose this should be supported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants