-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: general concat with ExtensionArrays through find_common_type #33607
ENH: general concat with ExtensionArrays through find_common_type #33607
Conversation
pandas/core/dtypes/concat.py
Outdated
if ( | ||
is_categorical_dtype(arr.dtype) | ||
and isinstance(dtype, np.dtype) | ||
and np.issubdtype(dtype, np.integer) | ||
): | ||
# problem case: categorical of int -> gives int as result dtype, | ||
# but categorical can contain NAs -> fall back to object dtype | ||
try: | ||
return arr.astype(dtype, copy=False) | ||
except ValueError: | ||
return arr.astype(object, copy=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This complication is to try to preserve some of the value-dependent behaviour of Categorical (in case of integer categories: missing values present or not?)
Eg when concatting integer categorical with integer series:
pd.concat([pd.Series([1, 2], dtype="category"), pd.Series([3, 4])])
-> results in int dtype
pd.concat([pd.Series([1, None], dtype="category"), pd.Series([3, 4])])
-> results in object dtype
Currently, when concatting, a Categorical with integer categories gets converted to int numpy array if no missing values are present, but object numpy array if missing values are present (to preserve the integers)
I needed to make one change to the tests related to categorical (another value-dependent behaviour). Assume those examples involving integer categorical and float arrays:
With a So I would say that the new behaviour (always returning float in the above two examples) is better. |
pandas/core/dtypes/base.py
Outdated
@@ -322,3 +323,33 @@ def _is_boolean(self) -> bool: | |||
bool | |||
""" | |||
return False | |||
|
|||
def _get_common_type(self, dtypes: List[DtypeObj]) -> Optional[DtypeObj]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmm can we keep the return type as ExtensionDtype
? Do you envision cases where we'd like to return a plain NumPy dtype?
Oh... I suppose tz-naive DatetimeArray might break this, since it wants to return a NumPy dtype...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that was my first thought as well. But, right now, eg Categorical can end up with any kind of numpy dtype (depending on the dtype of its categories).
As long as not yet all dtypes have a EA version, I don't think it is feasible to require ExtensionDtype here
I ran into one other hairy question:
Right now I added some special cases in |
pandas/core/arrays/integer.py
Outdated
@@ -95,6 +95,15 @@ def construct_array_type(cls) -> Type["IntegerArray"]: | |||
""" | |||
return IntegerArray | |||
|
|||
def _get_common_type(self, dtypes: List[DtypeObj]) -> Optional[DtypeObj]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be common_type or common_dtype? we've been loose about this distinction so far and i think it has caused amibiguity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't care that much. I mainly used "type", because it is meant to be used in find_common_type
.
(that find_common_type
name is inspired on the numpy function, and that one actually handles both dtypes and scalar types, which I assume is the reason for the name. The pandas version, though, doesn't really make the distinction, so could have been named "find_common_dtype")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to "common_dtype" instead of "common_type". The internal function that uses this is still find_common_type
, but that name from numpy is actually a misnomer here, since we are only dealing with dtypes, and not scalar types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for indulging me on this nitpick
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally good concept will look in more detail later
and arr.dtype.kind in ["m", "M"] | ||
and dtype is np.dtype("object") | ||
): | ||
# wrap datetime-likes in EA to ensure astype(object) gives Timestamp/Timedelta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 quality comment
|
||
elif _contains_datetime or "timedelta" in typs or _contains_period: | ||
elif _contains_datetime or "timedelta" in typs: | ||
return concat_datetime(to_concat, axis=axis, typs=typs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we do the DTA/TDA casting above, and do isinstance(obj, ExtensionArray)
checks, can all of the dt64/td64 cases be handled by the EA code above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, because they are not using ExtensionDtype.
I like this idea, will want to give it another look after more caffeine. Haven't looked at the tests or Categorical-specific nuances yet. |
|
||
return concat_categorical(to_concat) | ||
return union_categoricals(to_concat) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC only a relatively small part of the logic of concat_categorical/union_categoricals is actually needed here. I'd prefer for that to live here and for union_categoricals to call it, rather than the other way around (since union_categoricals handles a lot of cases). Could be considered orthogonally to this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer for that to live here and for union_categoricals to call it, rather than the other way around
Yes, that's indeed a good idea (union_categoricals does way much more that is all not needed here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you planning to update this, or is that a topic for a separate PR?
Right now, this is the case yes. What do you mean with making that explicit? (just better documenting it?)
It could be, but personally I don't really see to not make it an instance method. Yes,
This is the only sensible return value in such a case (an EA doing anything different will behave very strange). But what do you mean with "require"? Better document it? Or we could have a base extension test that asserts it? But see also my question above (#33607 (comment)): in principle we don't even need to call this
Can you explain this a bit more? I am not aware of any troubles regarding that. |
I have also been reading up on numpy dtype proposals regarding this. Specifically this part of the draft NEP relates to "common dtypes": https://github.com/numpy/numpy/blob/f07e25cdff3967a19c4cc45c6e1a94a38f53cee3/doc/neps/nep-0042-new-dtypes.rst#common-dtype-operations It's actually quite similar in idea, but some notable differences:
The first is an interesting idea. The second might only complicate it more for current pandas (although I understand its desire to split it up on steps to avoid duplicate logic in different parts). |
"trouble" in that we haven't been consistent about where some logic belongs. e.g. DTA._concat_same_type requires a single dtype, not just same-type. |
With this PR, Does that answer your concern? Otherwise you will still need to clarify further what you mean. |
Close enough. |
I made a bunch of updates (renamed to use "dtype" instead of "type", documented the public API change and updated the EA interface docs, added a base extension test). |
thanks @jorisvandenbossche very nice |
And thanks all for the review! |
Exploring what is being discussed in #22994. @TomAugspurger your idea seems to be working nicely! (it almost removes as much code than it adds (apart from tests/docs), ànd fixes the EA concat bugs ;))
Few notes compared to proposal in #22994:
find_common_type
function, decided to use this as the "get_concat_dtype", since it seems this does what is needed for concatExensionDtype._get_common_type
method that is used in pandas'find_common_type
function to dispatch the logic to the extension typeWhat I already handled:
concat_categorical
helper). This turned up a few cases where we have value-dependent behaviour right now, which we can't easily preserve (mainly regarding NaNs and int dtype, see below)Still need to handle sparse (those has failing tests now) and maybe datetime, and check other failures.