-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REF: put EA concat logic in _concat_arrays #33535
Conversation
Thanks for looking at this! If refactoring this, I am wondering if we should not directly try to do something more general such as Tom's negotation proposal.
Do we need all three from pandas' point of view? I know you already mentioned we could do without the existing Thinking out loud here: what if we have a Concatenating a list of arrays of any types could then be something like this in pseudo-code: def concat_array(arrays):
types = {x.dtype for x in arrays}
for typ in types:
res = typ.__concat_arrays__(arrays, types)
if res is not NotImplemented:
break
else:
# no dtype knew how to handle it, fallback to object dtype
res = np.concatenate([arr.astype(object) for arr in arrays])
return res The logic of which types to coerce on concatenation and which not can then be fully in the EA (and it uses the order of the arrays in case multiple EAs both can "handle" each other, but I don't know if we have this case with our own dtypes). And the special case of "same dtype" can easily be detected by the EA if the passed set of dtypes has length 1. So the big difference with |
I agree we will likely land on something like this long-term.
I expect we could get away with just
The first thing that comes to mind is in
I like the general idea. It needs a way of choosing what order to iterate over Back at the level of this PR, are we in agreement that the logic currently in e.g. |
Yes, but that is the "internal" organization in different functions within the array-specific module (so eg internal to
Yes, I think that's a good change to move those to the array modules. I would personally not put it in methods though, but just in functions. For code organisation in those files, I think it would be better to just use functions, certainly if they are not part of an EA interface (which would be a reason to have them as methods) |
Having |
But the same simplification can be done as function vs method, no? The only difference is calling it from the class, or needing to import it from the module and calling the function? (for sure, that's one line extra to do the import). The main reason that I suggested that, is that right now, it are methods that are completely independent from the class (eg But anyway, since this might all be relatively short-lived depending on the general concat-EA-interface discussion, it also doesn't matter too much ;) |
for other in to_union[1:] | ||
] | ||
new_codes = np.concatenate(codes) | ||
return Categorical._concat_same_dtype( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also move the full of union_categoricals
do the categorical array module?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that would be my preference too, but trying to keep the already-broad scope/diff limited
@@ -95,17 +95,23 @@ def is_nonempty(x) -> bool: | |||
_contains_datetime = any(typ.startswith("datetime") for typ in typs) | |||
_contains_period = any(typ.startswith("period") for typ in typs) | |||
|
|||
from pandas.core.arrays import Categorical, SparseArray, datetimelike as dtl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And to make my suggestion more concrete: instead of import Categorical here, it would be from pandas.core.arrays.categorical import _concat_arrays as concat_categorical
(or whathever name we give it)
To put it another way: if we didn't have the existing I will move my general comments about the concat protocol to #22994 |
On SparseArray it is a legitimate classmethod. |
You could still easily write that as a function that calls SparseArray (there are no subclasses for which this needs to be generic). But OK, as mentioned, we are going to refactor this anyhow later, so I won't further push for it if you prefer the class methods. |
I appreciate it. Given the amount of disagreement we're having in other threads, a little bit of compromise goes a long way towards keeping spirits up. (same idea behind being encouraging in #33561) |
Yeah, I thought I have to choose my fights ;) |
@jbrockmendel Tom and I have been further discussing the general protocol in #22994, and I did a prototype for that in #33607 now. Can you have a look at that? As it can potentially make (part of) this PR obsolete (eg it removes |
Closing in favor of #33607 |
cc @jorisvandenbossche @TomAugspurger per discussion in #32586 (among others) about
_concat_same_type
, this is a proof of concept for a 3-method solution:EA._concat_same_dtype
--> require same type and dtype; DTA/PA do this nowEA._concat_same_type
--> require same type but not necessarily same type; we could do without this, but since its already in the API...EA._concat_arrays
--> any ndarray/EAsFor example,
dtypes.concat._concat_sparse
naturally becomesSparseArray._concat_arrays
. The middle chunk ofunion_categoricals
becomesCategorical._concat_same_dtype
.Everything described above is just a refactor, putting logic in more reasonable places. The benefit interface-wise is that the dispatching in
concat_compat
looks likeATM the order Categorical -> DTA/TDA/PA -> Sparse is hard-coded, but we could generalize this either with a negotiation logic like Tom described or with something simpler like defining
EA.__concat_priority__ = 1, Categorical.__concat_priority__ = 1000, [...]
and the dispatch becomes: