-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect behavior when concatenating multiple ExtensionBlocks with different dtypes #22994
Comments
Do we need more fine grained control? Because I could assume that in some cases an ExtensionArray (eg with a parametrized dtype) would like to have a smarter way to concat arrays with different dtype than just converting to object? Also eg IntegerArray with int64 and int32 would not need to be converted to object ? |
Absolutely. We're even getting there with Sparse, since it would like to take non-sparse arrays and make them sparse, rather than going to object. Right now I think we just special case sparse before getting to concat. I think we're coming up on the need for a general |
Looking at the array function protocol as inspiration for the design looks a good idea to me |
Are we OK with pushing this to 0.25? As a proposal, we can have something like the following: Iterate through the dtypes, calling def get_concat_dtype(arrays): # internal to pandas
"""
Get the result dtype for concatenating many arrays.
Parameters
----------
arrays : Sequence[Union[numpy.ndarray, ExtensionArray]]
Returns
-------
dtype : Union[ExtensionDtype, numpy.dtype]
The NumPy dtype or ExtensionDtype to use for the concatenated
array.
"""
types = {x.dtype for x in arrays}
if len(types) == 1:
return list(types)[0]
seen = {}
# iterate in order of `arrays`
for arr in objs:
dtype = arr.dtype
if dtype not in seen:
seen.insert(dtype)
# this assumes it's an extension dtype, which isn't correct
result_dtype = dtype.get_concat_dtype(dtypes)
if result_dtype is not None:
return result_dtype
return np.dtype('object')
class ExtensionDtype:
...
@classmethod
def get_concat_dtype(cls, dtypes):
# part of the extension array API
return None So for SparseDtype, we would return a Some questions:
|
Example from duplicate issue: Consider concatting two dataframes with both a column with an extension dtype, but with a different one (here string and nullable int):
This errors because in the concatenation, we have the following code: pandas/pandas/core/internals/managers.py Lines 2021 to 2024 in 5c36aa1
and the The easy fix is to make |
Moving some discussion from #33535 here Thinking out loud here: what if we have a Concatenating a list of arrays of any types could then be something like this in pseudo-code: def concat_array(arrays):
types = {x.dtype for x in arrays}
for typ in types:
res = typ.__concat_arrays__(arrays, types)
if res is not NotImplemented:
break
else:
# no dtype knew how to handle it, fallback to object dtype
res = np.concatenate([arr.astype(object) for arr in arrays])
return res The logic of which types to coerce on concatenation and which not can then be fully in the EA (and it uses the order of the arrays in case multiple EAs both can "handle" each other, but I don't know if we have this case with our own dtypes). And the special case of "same dtype" can easily be detected by the EA if the passed set of dtypes has length 1. So the big difference with It would also eliminate the existing |
Just stepping back a bit, it seems like we have everything we need for the actual concatenation part. It's just the casting / dtype = get_result_dtype({x.dtype for x in arrays})
arrays = [x.astype(dtype) for x in arrays]
result = arrays[0]._concat_same_type(arrays) i.e. do all the casting ahead of time, and then call This puts the majority of the complexity in |
That's a good point (it still needs the So that brings the question: would there be cases where it would be beneficial / required to be able to only do the casting while concatting, instead of casting before hand. Or, would there be cases where you might cast differently knowing the arrays and not just dtypes? I suppose we want casting behaviour to be only dtype-dependent, and not value-dependent? (and similarly for concatting) I am trying to think of cases where we wouldn't follow that right now with our own dtypes. But can't directly think of something (eg categoricals only result in categoricals if dtypes are equal, even things like int dtype / object consisting of ints results in object dtype and does not try to infer the objects). |
The only potential use case I could come up with right now is if you want as an EA to be able to infer object dtype while concatting.
for which you need to inspect the values to know if that is possible to preserve the dtype. A potential behaviour like this (eg for external EAs) would not be possible with the proposal of getting the result dtype and casting before concatting. But, I suppose we are perfectly fine with "disabling" such potential behaviour? (if we want to have the behaviour dtype dependent and not value dependent, that's the logical consequence). |
Yeah, we don't really have anything :) We don't have a function to take a I'd be fine with losing the behavior you posted. I really, really like having dtype stability :) That partially comes from Dask, where we aren't able to replicate that behavior. But I also just thinks it's a generally good thing to strive for. |
Yes, I agree this is a good thing to strive for (and to be clear, I also don't think we actually have a case that conflicts with it right now internally, I think we already cleaned up concat-related dtype things quite a bit the last years).
We for sure need to to fix the astype situation as well, but I think we don't need to wait on it to handle the concat situation. In many cases, the astype will either be to object dtype (which all EAs should already handle fine), or to a very "close" dtype (eg int8 to int64), which will mostly already work in with the current astype, I think. So, that brings us to the "get_concat_dtype" method. Above you put a prototype like:
which I think is supposed to return None if it cannot concat, or the result dtype if it knows how to concat. Correct? So some questions related to this:
This we were just discussing, and I think agree that we want type stability.
I think that's fine, and will probably the case in any solution we come up with. I am also not sure we actually have cases where this would happen, right now, with our own dtypes?
I suppose in principle it shouldn't matter (since "self" will also be in the list of dtypes). I think it still wouldn't hurt to make it an instance method (in case an EA implementation wants that), since we will always have the instance available, right? One potential use case might for example be fletcher, where they have a single dtype class and all different dtypes are instances of that same class. Implementation wise it might be helpful for them to have the instance, instead of needing to infer it from the list of dtypes. |
Sorry for all the text :) Additional question: are we happy with
That last point might be too much "future thinking" though, if we don't yet have any specific use cases |
Agreed, doesn't matter which we use.
Agreed, let's not worry about that right now.
This feels like we're inviting order dependent behavior, but yes I think that's fine :)
Agreed, it should just be for EA authors. I'd prefer the
Ideally, I'd like to have similar logic / code backing That said, I recognize that I'm asking for a lot more than an easy way to concat |
Can you explain this a bit more? I understand that for concat, we just need to know the dtype to pass to astype and the use You mean like: when doing Also, for astype, we want to be more liberal in which casting is allowed, while for concat we want to be more strict? |
Sorry, may not have been clear. I think I was thinking of |
Yes, that seems correct. |
@TomAugspurger I think this can be closed now (with #33607 for the actual protocol method, and some of the follow-up issue to ensure it is used) |
Thanks! |
In
pandas/pandas/core/internals/managers.py
Line 1638 in d430195
we check that we have one type of block.
For ExtensionBlocks, that's insufficient. If you try to concatenate two series with different EA dtypes, it'll calling the first EA's
_concat_same_type
with incorrect types.For EA blocks, we need to ensure that they're the same dtype. When they differ, we should fall back to object.
Checking the dtypes actually solves a secondary problem. On master, we allow
concat([ Series[Period[D]], Series[Period[M]] ])
, i.e. concatenating series of periods with different frequencies. If we want to allow that still, we need to bail out before we get down toPeriodArray._concat_same_type
.The text was updated successfully, but these errors were encountered: