ENH: general concat with ExtensionArrays through find_common_type #33607

jorisvandenbossche · 2020-04-17T10:05:51Z

Exploring what is being discussed in #22994. @TomAugspurger your idea seems to be working nicely! (it almost removes as much code than it adds (apart from tests/docs), ànd fixes the EA concat bugs ;))

Few notes compared to proposal in #22994:

Since we already have a find_common_type function, decided to use this as the "get_concat_dtype", since it seems this does what is needed for concat
Extending the EA interface with a ExensionDtype._get_common_type method that is used in pandas' find_common_type function to dispatch the logic to the extension type

What I already handled:

general protocol, using this for concat with axis=0 (eg concatting series) and using it as example for IntegerDtype
handling categoricals this way (which removes the concat_categorical helper). This turned up a few cases where we have value-dependent behaviour right now, which we can't easily preserve (mainly regarding NaNs and int dtype, see below)

Still need to handle sparse (those has failing tests now) and maybe datetime, and check other failures.

pandas/core/arrays/integer.py

jorisvandenbossche · 2020-04-17T10:17:38Z

pandas/core/dtypes/concat.py

+                if (
+                    is_categorical_dtype(arr.dtype)
+                    and isinstance(dtype, np.dtype)
+                    and np.issubdtype(dtype, np.integer)
+                ):
+                    # problem case: categorical of int -> gives int as result dtype,
+                    # but categorical can contain NAs -> fall back to object dtype
+                    try:
+                        return arr.astype(dtype, copy=False)
+                    except ValueError:
+                        return arr.astype(object, copy=False)


This complication is to try to preserve some of the value-dependent behaviour of Categorical (in case of integer categories: missing values present or not?)

Eg when concatting integer categorical with integer series:

pd.concat([pd.Series([1, 2], dtype="category"), pd.Series([3, 4])]) -> results in int dtype pd.concat([pd.Series([1, None], dtype="category"), pd.Series([3, 4])]) -> results in object dtype

Currently, when concatting, a Categorical with integer categories gets converted to int numpy array if no missing values are present, but object numpy array if missing values are present (to preserve the integers)

jorisvandenbossche · 2020-04-17T10:22:43Z

I needed to make one change to the tests related to categorical (another value-dependent behaviour). Assume those examples involving integer categorical and float arrays:

In [13]: pd.concat([pd.Series([1, None], dtype="category"), pd.Series([3.5, 4.5])])    
Out[13]: 
0      1
1    NaN
0    3.5
1    4.5
dtype: object

In [14]: pd.concat([pd.Series([1, 2], dtype="category"), pd.Series([3.5, 4.5])])    
Out[14]: 
0    1.0
1    2.0
0    3.5
1    4.5
dtype: float64

With a find_common_type, we will always find float. But the old behaviour is that the integer categorical becomes object if there are missing values present.
But I would say: if you are concatting with a float anyway, it doesn't make sense to try to provide the "integer categories" by keeping it object instead of casting to float.

So I would say that the new behaviour (always returning float in the above two examples) is better.

pandas/core/arrays/integer.py

TomAugspurger · 2020-04-17T10:57:47Z

pandas/core/dtypes/base.py

@@ -322,3 +323,33 @@ def _is_boolean(self) -> bool:
        bool
        """
        return False
+
+    def _get_common_type(self, dtypes: List[DtypeObj]) -> Optional[DtypeObj]:


Mmm can we keep the return type as ExtensionDtype? Do you envision cases where we'd like to return a plain NumPy dtype?

Oh... I suppose tz-naive DatetimeArray might break this, since it wants to return a NumPy dtype...

Yes, that was my first thought as well. But, right now, eg Categorical can end up with any kind of numpy dtype (depending on the dtype of its categories).

As long as not yet all dtypes have a EA version, I don't think it is feasible to require ExtensionDtype here

jorisvandenbossche · 2020-04-17T12:29:38Z

I ran into one other hairy question:

Should _get_common_type be able to handle "not fuly initialized" dtype objects? Like a CategoricalDtype without categories? (which we have to handle "category" string).
And if so, what is the expected behaviour?

Right now I added some special cases in Categorical._get_common_type to have the find_common_type tests passing, but not sure it is necessarily doing the correct thing in a general way.

jbrockmendel · 2020-04-17T14:34:07Z

pandas/core/arrays/integer.py

@@ -95,6 +95,15 @@ def construct_array_type(cls) -> Type["IntegerArray"]:
        """
        return IntegerArray

+    def _get_common_type(self, dtypes: List[DtypeObj]) -> Optional[DtypeObj]:


should this be common_type or common_dtype? we've been loose about this distinction so far and i think it has caused amibiguity

I don't care that much. I mainly used "type", because it is meant to be used in find_common_type.

(that find_common_type name is inspired on the numpy function, and that one actually handles both dtypes and scalar types, which I assume is the reason for the name. The pandas version, though, doesn't really make the distinction, so could have been named "find_common_dtype")

Renamed to "common_dtype" instead of "common_type". The internal function that uses this is still find_common_type, but that name from numpy is actually a misnomer here, since we are only dealing with dtypes, and not scalar types.

Thanks for indulging me on this nitpick

pandas/core/internals/concat.py

jreback

generally good concept will look in more detail later

pandas/core/arrays/categorical.py

pandas/core/arrays/integer.py

pandas/core/dtypes/base.py

pandas/core/dtypes/cast.py

pandas/core/dtypes/concat.py

jbrockmendel · 2020-04-17T14:47:16Z

pandas/core/dtypes/concat.py

+        and arr.dtype.kind in ["m", "M"]
+        and dtype is np.dtype("object")
+    ):
+        # wrap datetime-likes in EA to ensure astype(object) gives Timestamp/Timedelta


+1 quality comment

pandas/core/dtypes/concat.py

jbrockmendel · 2020-04-17T14:51:30Z

pandas/core/dtypes/concat.py


-    elif _contains_datetime or "timedelta" in typs or _contains_period:
+    elif _contains_datetime or "timedelta" in typs:
        return concat_datetime(to_concat, axis=axis, typs=typs)


if we do the DTA/TDA casting above, and do isinstance(obj, ExtensionArray) checks, can all of the dt64/td64 cases be handled by the EA code above?

I don't think so, because they are not using ExtensionDtype.

jbrockmendel · 2020-04-17T14:53:21Z

I like this idea, will want to give it another look after more caffeine. Haven't looked at the tests or Categorical-specific nuances yet.

jbrockmendel · 2020-04-17T15:01:36Z

pandas/core/arrays/categorical.py


-        return concat_categorical(to_concat)
+        return union_categoricals(to_concat)


IIRC only a relatively small part of the logic of concat_categorical/union_categoricals is actually needed here. I'd prefer for that to live here and for union_categoricals to call it, rather than the other way around (since union_categoricals handles a lot of cases). Could be considered orthogonally to this PR.

I'd prefer for that to live here and for union_categoricals to call it, rather than the other way around

Yes, that's indeed a good idea (union_categoricals does way much more that is all not needed here)

are you planning to update this, or is that a topic for a separate PR?

jorisvandenbossche · 2020-04-24T13:49:37Z

Are we implicitly assuming that self.dtype is a member of dtypes?

Right now, this is the case yes. What do you mean with making that explicit? (just better documenting it?)

Should _get_common_type be a classmethod?

It could be, but personally I don't really see to not make it an instance method. Yes, self will always be in the list of dtypes (see question above), but I can assume that for the implementation it can be useful to know self (instead of needing to infer it from the list of dtypes). Not that this should impact the result, I only think it can be handy in the implementation (but this is rather hypothetical feeling at the moment).
The main thing is: we always call this from the instance, so I don't see much value in making it a class method instead of instance method.

In the case (mentioned in a comment on the base class) where dtypes is a singleton, can we require that the return value be that dtype?

This is the only sensible return value in such a case (an EA doing anything different will behave very strange). But what do you mean with "require"? Better document it? Or we could have a base extension test that asserts it?

But see also my question above (#33607 (comment)): in principle we don't even need to call this _get_common_type method if there is only one dtype in the list. In the concat code / find_common_type code, we can simply use this dtype and shortcut calling _get_common_type.

Can we nail down the common_type vs common_dtype semantics before pushing this through? I feel like that has been a major cause of trouble with the existing concat code.

Can you explain this a bit more? I am not aware of any troubles regarding that.

jorisvandenbossche · 2020-04-24T14:22:36Z

I have also been reading up on numpy dtype proposals regarding this. Specifically this part of the draft NEP relates to "common dtypes": https://github.com/numpy/numpy/blob/f07e25cdff3967a19c4cc45c6e1a94a38f53cee3/doc/neps/nep-0042-new-dtypes.rst#common-dtype-operations

It's actually quite similar in idea, but some notable differences:

The dtype special method is limited to a single other dtype (they have Dtype.__common_dtype__(self, other) where other is a single dtype), instead of passing the list of all dtypes
They split the "common dtype" special method in two: determine the common class (from 2 different dtype classes), and determine the common instance (from 2 instances of the same class). And it is actually split in 3, because those 2 steps are then combined with as cast-lookup.

The first is an interesting idea. The second might only complicate it more for current pandas (although I understand its desire to split it up on steps to avoid duplicate logic in different parts).
And I also don't know if first doing it on the class level will work for pandas' current promotion logic. Eg the common dtype of Categorical + int dtype depends on the dtype of the categories of the Categorical.

jbrockmendel · 2020-04-24T16:14:26Z

Can we nail down the common_type vs common_dtype semantics before pushing this through? I feel like that has been a major cause of trouble with the existing concat code.

Can you explain this a bit more? I am not aware of any troubles regarding that.

"trouble" in that we haven't been consistent about where some logic belongs. e.g. DTA._concat_same_type requires a single dtype, not just same-type.

jorisvandenbossche · 2020-04-24T16:21:24Z

With this PR, EA._concat_same_type will only be called with arrays of all the same dtype.
(and which needs to be better documented now, as that was not clear from its docstring what the expectations were before)

Does that answer your concern? Otherwise you will still need to clarify further what you mean.

jbrockmendel · 2020-04-24T17:14:08Z

Does that answer your concern? Otherwise you will still need to clarify further what you mean.

Close enough.

jorisvandenbossche · 2020-05-01T13:47:47Z

I made a bunch of updates (renamed to use "dtype" instead of "type", documented the public API change and updated the EA interface docs, added a base extension test).
Ready for another review (or approval ;)) round

pandas/core/dtypes/cast.py

jreback · 2020-05-02T16:29:49Z

thanks @jorisvandenbossche very nice

jorisvandenbossche · 2020-05-02T17:26:21Z

And thanks all for the review!

…ndas-dev#33607)

jorisvandenbossche added 2 commits April 17, 2020 11:53

initial find_common_type/_get_common_type + tests for IntegerDtype

3464e95

handle categoricals

b1d9d68

jorisvandenbossche mentioned this pull request Apr 17, 2020

REF: put EA concat logic in _concat_arrays #33535

Closed

jorisvandenbossche requested review from jbrockmendel and TomAugspurger April 17, 2020 10:08

jorisvandenbossche commented Apr 17, 2020

View reviewed changes

pandas/core/arrays/integer.py Show resolved Hide resolved

jorisvandenbossche commented Apr 17, 2020

View reviewed changes

TomAugspurger reviewed Apr 17, 2020

View reviewed changes

jorisvandenbossche added 2 commits April 17, 2020 14:12

handle sparse

bb398e7

handle non-initialized CategoricalDtype in find_common_type

83fdc91

handle datetimelike special case

7f2ac2a

jorisvandenbossche added API Design Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. labels Apr 17, 2020

jorisvandenbossche added this to the 1.1 milestone Apr 17, 2020