Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF: put EA concat logic in _concat_arrays #33535

Closed
wants to merge 8 commits into from

Conversation

jbrockmendel
Copy link
Member

cc @jorisvandenbossche @TomAugspurger per discussion in #32586 (among others) about _concat_same_type, this is a proof of concept for a 3-method solution:

  • EA._concat_same_dtype --> require same type and dtype; DTA/PA do this now
  • EA._concat_same_type --> require same type but not necessarily same type; we could do without this, but since its already in the API...
  • EA._concat_arrays --> any ndarray/EAs

For example, dtypes.concat._concat_sparse naturally becomes SparseArray._concat_arrays. The middle chunk of union_categoricals becomes Categorical._concat_same_dtype.

Everything described above is just a refactor, putting logic in more reasonable places. The benefit interface-wise is that the dispatching in concat_compat looks like

if we_have_any_categoricals:
    return Categorical._concat_arrays(to_concat)
elif we_have_any_datetimelike:
    return DatetimeLike._concat_arrays(to_concat)
elif we_have_any_sparse:
    return SparseArray._concat_arrays(to_concat)
[...]

ATM the order Categorical -> DTA/TDA/PA -> Sparse is hard-coded, but we could generalize this either with a negotiation logic like Tom described or with something simpler like defining EA.__concat_priority__ = 1, Categorical.__concat_priority__ = 1000, [...] and the dispatch becomes:

eas = {type(x) for x in to_concat if isinstance(x, ExtensionArray)}
eas = list(eas)
eas.sort(lambda x: x.__concat_priority__)

if eas:
    return eas[-1]._concat_arrays(to_concat)

@jbrockmendel jbrockmendel added the Refactor Internal refactoring of code label Apr 14, 2020
@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Apr 14, 2020

Thanks for looking at this!

If refactoring this, I am wondering if we should not directly try to do something more general such as Tom's negotation proposal.
Although a __concat_priority__ might be simpler, I think it is very hard to get right as EA author: you don't necessarily know which numbers others EAs use, so which number to set? Additionally, it might also not be one "linear" order.

this is a proof of concept for a 3-method solution:

Do we need all three from pandas' point of view? I know you already mentioned we could do without the existing _concat_same_type if the others are introduced. But do we actually need _concat_same_dtype?
In principle, the EA could be responsible of handling this as a special case in EA._concat_arrays if it wants ? Or are there cases in pandas where it would be benefical to be able to call EA._concat_same_dtype instead of EA._concat_arrays? (maybe knowing in advance the resulting dtype will be the same? I don't know if that is needed)

Thinking out loud here: what if we have a ExtensionDtype.__concat__(self, arrays, dtypes) that either returns a concatenated array if it decides it knows how to concat those arrays, or otherwise returns NotImplemented.
So this is kind of a hybrid between EA._concat_arrays and the ExtensionDtype.get_concat_dtype from #22994 (comment)

Concatenating a list of arrays of any types could then be something like this in pseudo-code:

def concat_array(arrays):
    types = {x.dtype for x in arrays}

    for typ in types:
        res = typ.__concat_arrays__(arrays, types)
        if res is not NotImplemented:
            break
    else:
        # no dtype knew how to handle it, fallback to object dtype
        res = np.concatenate([arr.astype(object) for arr in arrays])
    return res

The logic of which types to coerce on concatenation and which not can then be fully in the EA (and it uses the order of the arrays in case multiple EAs both can "handle" each other, but I don't know if we have this case with our own dtypes). And the special case of "same dtype" can easily be detected by the EA if the passed set of dtypes has length 1.

So the big difference with get_concat_dtypes in #22994 (comment) is that this directly gives you the array, and not the dtype.
Do we think it is important to be able to know the resulting dtype in advance, before actually doing the concatenation?

@jbrockmendel
Copy link
Member Author

If refactoring this, I am wondering if we should not directly try to do something more general such as Tom's negotation proposal.

I agree we will likely land on something like this long-term.

Do we need all three from padas' point of view? I know you already mentioned we could do without the existing _concat_same_type if the others are introduced. But do we actually need _concat_same_dtype?

I expect we could get away with just _concat_arrays. For our internal EAs i think the 3-method layout is nicer than having the somewhat scattered logic we have now (and mis-nomered logic for Categorical and DTA/PA._concat_same_type).

(maybe knowing in advance the resulting dtype will be the same? I don't know if that is needed)

The first thing that comes to mind is in internals.concat where I think knowing the return type would allow us to do something like make_block_same_class instead of make_block. That's pretty minor.

Concatenating a list of arrays of any types could then be something like this in pseudo-code:

I like the general idea. It needs a way of choosing what order to iterate over types in. e.g. ATM in concat_compat things depend on Categorical being checked before other EAs.

Back at the level of this PR, are we in agreement that the logic currently in e.g. dtypes.concat._concat_sparse makes more sense as a SparseArray method, and similar for the other pandas-internal EAs that have their logic implemented in dtypes.concat?

@jbrockmendel jbrockmendel added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Apr 14, 2020
@jorisvandenbossche
Copy link
Member

I expect we could get away with just _concat_arrays. For our internal EAs i think the 3-method layout is nicer than having the somewhat scattered logic we have now

Yes, but that is the "internal" organization in different functions within the array-specific module (so eg internal to arrays/categorical.py). For sure there it can make sense to split it in different functions. But I am mainly thinking in "interface" terms, eg for the purpose of implementing pd.concat / internals.concat

Back at the level of this PR, are we in agreement that the logic currently in e.g. dtypes.concat._concat_sparse makes more sense as a SparseArray method, and similar for the other pandas-internal EAs that have their logic implemented in dtypes.concat?

Yes, I think that's a good change to move those to the array modules. I would personally not put it in methods though, but just in functions. For code organisation in those files, I think it would be better to just use functions, certainly if they are not part of an EA interface (which would be a reason to have them as methods)

@jbrockmendel
Copy link
Member Author

would personally not put it in methods though, but just in functions

Having _concat_arrays defined on SparseArray/DTA/PA/Categorical will allow for a nice simplification+clarification in concat_compat. How strong is your preference for functions over (private) methods?

@jorisvandenbossche
Copy link
Member

Having _concat_arrays defined on SparseArray/DTA/PA/Categorical will allow for a nice simplification+clarification in concat_compat

But the same simplification can be done as function vs method, no? The only difference is calling it from the class, or needing to import it from the module and calling the function? (for sure, that's one line extra to do the import).

The main reason that I suggested that, is that right now, it are methods that are completely independent from the class (eg _concat_arrays is a class method, but it's actually a static method as it is not using the class). And I mean code-wise (technically they can be functions), of course organisation-wise they can belong in that file.
So no very strong preference right now, but long term I think it will be easier to put them in functions separate from the class definition (it also keeps the class body shorter). Because also long term we will probably not call those _concat_arrays/_concat_same_dtype/_concat_same_type from the concat code (at least not all of them), we will rather use the EA interface and those three functions will be purely private details in the array module. Which even gives less reason to have them as class methods vs plain functions. So therefore, I would personally already put them in functions right now (they can still be wrapped in a class method for short term usage if you prefer that over importing them as functions).

But anyway, since this might all be relatively short-lived depending on the general concat-EA-interface discussion, it also doesn't matter too much ;)

for other in to_union[1:]
]
new_codes = np.concatenate(codes)
return Categorical._concat_same_dtype(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also move the full of union_categoricals do the categorical array module?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would be my preference too, but trying to keep the already-broad scope/diff limited

@@ -95,17 +95,23 @@ def is_nonempty(x) -> bool:
_contains_datetime = any(typ.startswith("datetime") for typ in typs)
_contains_period = any(typ.startswith("period") for typ in typs)

from pandas.core.arrays import Categorical, SparseArray, datetimelike as dtl
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And to make my suggestion more concrete: instead of import Categorical here, it would be from pandas.core.arrays.categorical import _concat_arrays as concat_categorical (or whathever name we give it)

@jorisvandenbossche
Copy link
Member

To put it another way: if we didn't have the existing _concat_same_type class method, you might not have thought to put the other methods as classmethods on the class. And since we might want to get rid of _concat_same_type, we don't necessarily need to model the new functions after that one.

I will move my general comments about the concat protocol to #22994

@jbrockmendel
Copy link
Member Author

(eg _concat_arrays is a class method, but it's actually a static method as it is not using the class)

On SparseArray it is a legitimate classmethod.

@jorisvandenbossche
Copy link
Member

You could still easily write that as a function that calls SparseArray (there are no subclasses for which this needs to be generic).

But OK, as mentioned, we are going to refactor this anyhow later, so I won't further push for it if you prefer the class methods.

@jbrockmendel
Copy link
Member Author

But OK, as mentioned, we are going to refactor this anyhow later, so I won't further push for it if you prefer the class methods.

I appreciate it. Given the amount of disagreement we're having in other threads, a little bit of compromise goes a long way towards keeping spirits up. (same idea behind being encouraging in #33561)

@jorisvandenbossche
Copy link
Member

Yeah, I thought I have to choose my fights ;)

@jorisvandenbossche
Copy link
Member

@jbrockmendel Tom and I have been further discussing the general protocol in #22994, and I did a prototype for that in #33607 now. Can you have a look at that? As it can potentially make (part of) this PR obsolete (eg it removes concat_categorical alltogether, not just moves it to Categorical)

@jbrockmendel
Copy link
Member Author

Closing in favor of #33607

@jbrockmendel jbrockmendel deleted the ea-concat branch April 17, 2020 14:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Refactor Internal refactoring of code Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants