-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: Incorproate ArrowDtype into ArrowExtensionArray #47034
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
yah this is what i had in mind |
Have this in a pretty good state now.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more comments on the tests.
I also noticed that the equality of the dtype is not working properly (which I assume also has some effect on other things):
In [25]: from pandas.core.arrays.arrow import ArrowExtensionArray
In [26]: arr1 = ArrowExtensionArray(pa.array([1, 2, 3], pa.int64()))
In [27]: arr2 = ArrowExtensionArray(pa.array([1, 2, 3], pa.float64()))
In [28]: arr1.dtype
Out[28]: int64[pyarrow]
In [29]: arr2.dtype
Out[29]: double[pyarrow]
In [30]: arr1.dtype == arr2.dtype
Out[30]: True
@@ -91,12 +140,9 @@ def _get_common_dtype(self, dtypes: list[DtypeObj]) -> DtypeObj | None: | |||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something to note somewhere (rather for a follow-up) that the above will only work for pyarrow types that have a matching numpy dtype, which is not the case for all types (eg decimal, dictionary, date, nested types, etc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I defaulted the numpy_dtype
to np.dtype(object)
to those pyarrow types without a corresponding numpy type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure that will help (but without having tests that cover this it is hard to say). If .numpy_dtype
returns object dtype, then the common dtype here will also be object dtype, and then pa.from_numpy_dtype
below will fail (so this function will return None). That means that there can never be a proper ArrowDtype
common dtype (that is not object dtype) for such arrow types.
For example, if you have one array of decimal and one of float, the common dtype could be float (not sure we want to do this, but let's assume for the example). With the current implementation, the numpy_dtype
for those extension dtypes will be np.dtype(object)
and np.dtype(float)
. The common dtype for that will always be np.dtype(object)
, which means that concatting such columns will result in a cast to object dtype, instead of casting to/preserving the float dtype.
So at some point, we should probably include pyarrow-specific logic in here that doesn't rely on converting to a numpy dtype and numpy's notion of a common type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So at some point, we should probably include pyarrow-specific logic in here that doesn't rely on converting to a numpy dtype and numpy's notion of a common type.
Agreed, and makes sense that this shouldn't be object
dtype in the long term. Would be great if this eventually follows pyarrow's type coercion rules if there is one :)
Awesome thanks for the second review. I was able to address the dtype |
Updated & all green |
self._dtype = ArrowDtype(self._data.type) | ||
|
||
@classmethod | ||
def _from_sequence(cls, scalars, *, dtype: Dtype | None = None, copy=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mroeschke i just tried the following and got an ArrowInvalid exception
arr = pa.array([1, 2, 3])
ea = pd.core.arrays.ArrowExtensionArray._from_sequence(arr)
should this work?
update: looks like just __init__
works fine here. still surprising that from_sequence doesnt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to #48238, I hadn't really anticipated users passing pyarrow arrays but I suppose this should be supported.
Not fully user facing yet.
Supersedes #46972
cc @jbrockmendel let me know if this is what you had in mind