Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect behavior when concatenating multiple ExtensionBlocks with different dtypes #22994

Closed
TomAugspurger opened this issue Oct 4, 2018 · 18 comments
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@TomAugspurger
Copy link
Contributor

In

if all(type(b) is type(blocks[0]) for b in blocks[1:]): # noqa

we check that we have one type of block.

For ExtensionBlocks, that's insufficient. If you try to concatenate two series with different EA dtypes, it'll calling the first EA's _concat_same_type with incorrect types.

In [13]: from pandas.tests.extension.decimal.test_decimal import *

In [14]: import pandas as pd

In [15]: a = pd.Series(pd.core.arrays.integer_array([1, 2]))

In [16]: b = pd.Series(DecimalArray([decimal.Decimal(1), decimal.Decimal(2)]))

In [17]: pd.concat([a, b])
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-17-714da278d09e> in <module>
----> 1 pd.concat([a, b])

~/sandbox/pandas/pandas/core/reshape/concat.py in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    225                        verify_integrity=verify_integrity,
    226                        copy=copy, sort=sort)
--> 227     return op.get_result()
    228
    229

~/sandbox/pandas/pandas/core/reshape/concat.py in get_result(self)
    389
    390                 mgr = self.objs[0]._data.concat([x._data for x in self.objs],
--> 391                                                 self.new_axes)
    392                 cons = _concat._get_series_result_type(mgr, self.objs)
    393                 return cons(mgr, name=name).__finalize__(self, method='concat')

~/sandbox/pandas/pandas/core/internals/managers.py in concat(self, to_concat, new_axis)
   1637
   1638             if all(type(b) is type(blocks[0]) for b in blocks[1:]):  # noqa
-> 1639                 new_block = blocks[0].concat_same_type(blocks)
   1640             else:
   1641                 values = [x.values for x in blocks]

~/sandbox/pandas/pandas/core/internals/blocks.py in concat_same_type(self, to_concat, placement)
   2047         """
   2048         values = self._holder._concat_same_type(
-> 2049             [blk.values for blk in to_concat])
   2050         placement = placement or slice(0, len(values), 1)
   2051         return self.make_block_same_class(values, ndim=self.ndim,

~/sandbox/pandas/pandas/core/arrays/integer.py in _concat_same_type(cls, to_concat)
    386     def _concat_same_type(cls, to_concat):
    387         data = np.concatenate([x._data for x in to_concat])
--> 388         mask = np.concatenate([x._mask for x in to_concat])
    389         return cls(data, mask)
    390

~/sandbox/pandas/pandas/core/arrays/integer.py in <listcomp>(.0)
    386     def _concat_same_type(cls, to_concat):
    387         data = np.concatenate([x._data for x in to_concat])
--> 388         mask = np.concatenate([x._mask for x in to_concat])
    389         return cls(data, mask)
    390

AttributeError: 'DecimalArray' object has no attribute '_mask'

For EA blocks, we need to ensure that they're the same dtype. When they differ, we should fall back to object.

Checking the dtypes actually solves a secondary problem. On master, we allow concat([ Series[Period[D]], Series[Period[M]] ]), i.e. concatenating series of periods with different frequencies. If we want to allow that still, we need to bail out before we get down to PeriodArray._concat_same_type.

@TomAugspurger TomAugspurger added Reshaping Concat, Merge/Join, Stack/Unstack, Explode ExtensionArray Extending pandas with custom dtypes or arrays. labels Oct 4, 2018
@TomAugspurger TomAugspurger added this to the 0.24.0 milestone Oct 4, 2018
@jorisvandenbossche
Copy link
Member

Do we need more fine grained control?

Because I could assume that in some cases an ExtensionArray (eg with a parametrized dtype) would like to have a smarter way to concat arrays with different dtype than just converting to object?

Also eg IntegerArray with int64 and int32 would not need to be converted to object ?

@TomAugspurger
Copy link
Contributor Author

Do we need more fine grained control?

Absolutely. We're even getting there with Sparse, since it would like to take non-sparse arrays and make them sparse, rather than going to object. Right now I think we just special case sparse before getting to concat.

I think we're coming up on the need for a general concat_array mechanism. We scan the list of types, and try those. I wonder how much we should piggy back on https://www.numpy.org/neps/nep-0018-array-function-protocol.html. I think it would be nice if a pd.api.extensions.concatenate_array simply became np.concatenate some day.

@jorisvandenbossche
Copy link
Member

Looking at the array function protocol as inspiration for the design looks a good idea to me

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Nov 6, 2018

Are we OK with pushing this to 0.25?

As a proposal, we can have something like the following:

Iterate through the dtypes, calling ExtensionDtype.get_concat_dtype(dtypes). As soon as an array type says "I know how to handle all these dtypes" by returning a non-None dtype, we stop looking.

def get_concat_dtype(arrays):  # internal to pandas
    """
    Get the result dtype for concatenating many arrays.

    Parameters
    ----------
    arrays : Sequence[Union[numpy.ndarray, ExtensionArray]]

    Returns
    -------
    dtype : Union[ExtensionDtype, numpy.dtype]
        The NumPy dtype or ExtensionDtype to use for the concatenated
        array.
    """
    types = {x.dtype for x in arrays}
    if len(types) == 1:
        return list(types)[0]

    seen = {}
    # iterate in order of `arrays`
    for arr in objs:
        dtype = arr.dtype
        if dtype not in seen:
            seen.insert(dtype)

        # this assumes it's an extension dtype, which isn't correct
        result_dtype = dtype.get_concat_dtype(dtypes)
        if result_dtype is not None:
            return result_dtype

    return np.dtype('object')

class ExtensionDtype:
    ...
    @classmethod
    def get_concat_dtype(cls, dtypes):
        # part of the extension array API
        return None

So for SparseDtype, we would return a SparseDtype(object), or a SparseDtype(dtype) if the subtypes are all coercible. For IntegerArray, we would return the highest precision necessary (follow the rules of the underlying dtypes). Categorical could do something like union_categoricals.

Some questions:

  1. Should this be "type stable", i.e. the ExtensionDtype.get_concat_dtype only sees types and not values?
  2. Should we special case self in ExtensionDtype.get_concat_dtype. Right now I made it a class method, but do we want to know that the actual instance?

@jorisvandenbossche
Copy link
Member

Example from duplicate issue:

Consider concatting two dataframes with both a column with an extension dtype, but with a different one (here string and nullable int):

In [9]: df1 = pd.DataFrame({'A': ['a', 'b']}).astype("string")   

In [10]: df2 = pd.DataFrame({'A': [1, 2]}).astype('Int64')  

In [11]: pd.concat([df1, df2])   
...
~/scipy/pandas/pandas/core/internals/blocks.py in concat_same_type(self, to_concat, placement)
  1838         Concatenate list of single blocks of the same type.
  1839         """
-> 1840         values = self._holder._concat_same_type([blk.values for blk in to_concat])
  1841         placement = placement or slice(0, len(values), 1)
  1842         return self.make_block_same_class(values, ndim=self.ndim, placement=placement)

~/scipy/pandas/pandas/core/arrays/numpy_.py in _concat_same_type(cls, to_concat)
   157     @classmethod
   158     def _concat_same_type(cls, to_concat):
--> 159         return cls(np.concatenate(to_concat))
   160 
   161     # ------------------------------------------------------------------------

~/scipy/pandas/pandas/core/arrays/string_.py in __init__(self, values, copy)
   154         self._dtype = StringDtype()
   155         if not skip_validation:
--> 156             self._validate()
   157 
   158     def _validate(self):

~/scipy/pandas/pandas/core/arrays/string_.py in _validate(self)
   160         if len(self._ndarray) and not lib.is_string_array(self._ndarray, skipna=True):
   161             raise ValueError(
--> 162                 "StringArray requires a sequence of strings or missing values."
   163             )
   164         if self._ndarray.dtype != "object":

ValueError: StringArray requires a sequence of strings or missing values.

This errors because in the concatenation, we have the following code:

elif is_uniform_join_units(join_units):
b = join_units[0].block.concat_same_type(
[ju.block for ju in join_units], placement=placement
)

and the is_uniform_join_units only checks for ExtensionBlock, and not for the dtype of the block. Therefore, the ExtensionBlock.concat_same_type -> ExtensionArray._concat_same_type gets called assuming that all values are of the same dtype (which is not the case here, leading to the above error).

The easy fix is to make is_uniform_join_units do a stricter check. But, we also need to decide on the alternative handling: how can some blocks still coerce? (eg Int64 and Int32 should result in Int64 ?)

@jorisvandenbossche
Copy link
Member

Moving some discussion from #33535 here

Thinking out loud here: what if we have a ExtensionDtype.__concat__(self, arrays, dtypes) that either returns a concatenated array if it decides it knows how to concat those arrays, or otherwise returns NotImplemented.
So this is kind of a hybrid between EA._concat_arrays (from #33535) and the ExtensionDtype.get_concat_dtype from above.

Concatenating a list of arrays of any types could then be something like this in pseudo-code:

def concat_array(arrays):
    types = {x.dtype for x in arrays}

    for typ in types:
        res = typ.__concat_arrays__(arrays, types)
        if res is not NotImplemented:
            break
    else:
        # no dtype knew how to handle it, fallback to object dtype
        res = np.concatenate([arr.astype(object) for arr in arrays])
    return res

The logic of which types to coerce on concatenation and which not can then be fully in the EA (and it uses the order of the arrays in case multiple EAs both can "handle" each other, but I don't know if we have this case with our own dtypes). And the special case of "same dtype" can easily be detected by the EA if the passed set of dtypes has length 1.

So the big difference with get_concat_dtypes from above is that this directly gives you the array, and not the dtype.
Do we think it is important to be able to know the resulting dtype in advance, before actually doing the concatenation?

It would also eliminate the existing _concat_same_type method, where in principle we know that the result will be an array of the same class / dtype as the calling dtype. Again, is that an important feature to have? Normally we can easily infer from the resulting array in what kind of block it needs to be placed, so that also seems minor.
I suppose that we can also still assume that if we only pass arrays of its own dtype, you get an array of that dtype back.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Apr 15, 2020

Just stepping back a bit, it seems like we have everything we need for the actual concatenation part. It's just the casting / astype part that's causing issues. I'd love to be able to write concat as

dtype = get_result_dtype({x.dtype for x in arrays})
arrays = [x.astype(dtype) for x in arrays]
result = arrays[0]._concat_same_type(arrays)

i.e. do all the casting ahead of time, and then call _concat_same_type to get the output array.

This puts the majority of the complexity in astype, which we need to solve anyway. I worry that in the __concat_arrays__ protocol, the implementations are going to look a lot like .astype anyway.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Apr 15, 2020

That's a good point (it still needs the get_concat_dtype, right? so we don't have everything yet?)

So that brings the question: would there be cases where it would be beneficial / required to be able to only do the casting while concatting, instead of casting before hand. Or, would there be cases where you might cast differently knowing the arrays and not just dtypes?

I suppose we want casting behaviour to be only dtype-dependent, and not value-dependent? (and similarly for concatting)
Which, if that holds true, might indicate that it should not matter if we cast beforehand or during the concat operation.

I am trying to think of cases where we wouldn't follow that right now with our own dtypes. But can't directly think of something (eg categoricals only result in categoricals if dtypes are equal, even things like int dtype / object consisting of ints results in object dtype and does not try to infer the objects).

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Apr 15, 2020

The only potential use case I could come up with right now is if you want as an EA to be able to infer object dtype while concatting.
Something we currently don't do ourselves I think (at least for the few cases I tested), but assume that you want to preserve the datetime64 dtype in a case like this:

pd.concat([
    pd.Series([pd.Timestamp("2012-01-01")]),
    pd.Series([pd.Timestamp("2012-01-02")], dtype=object)
])  

for which you need to inspect the values to know if that is possible to preserve the dtype. A potential behaviour like this (eg for external EAs) would not be possible with the proposal of getting the result dtype and casting before concatting.

But, I suppose we are perfectly fine with "disabling" such potential behaviour? (if we want to have the behaviour dtype dependent and not value dependent, that's the logical consequence).

@TomAugspurger
Copy link
Contributor Author

it still needs the get_concat_dtype, right? so we don't have everything yet?

Yeah, we don't really have anything :) We don't have a function to take a List[Dtype] -> dtype, and we don't have the astype protocol down yet.

I'd be fine with losing the behavior you posted. I really, really like having dtype stability :) That partially comes from Dask, where we aren't able to replicate that behavior. But I also just thinks it's a generally good thing to strive for.

@jorisvandenbossche
Copy link
Member

Yes, I agree this is a good thing to strive for (and to be clear, I also don't think we actually have a case that conflicts with it right now internally, I think we already cleaned up concat-related dtype things quite a bit the last years).

Yeah, we don't really have anything :) We don't have a function to take a List[Dtype] -> dtype, and we don't have the astype protocol down yet.

We for sure need to to fix the astype situation as well, but I think we don't need to wait on it to handle the concat situation. In many cases, the astype will either be to object dtype (which all EAs should already handle fine), or to a very "close" dtype (eg int8 to int64), which will mostly already work in with the current astype, I think.

So, that brings us to the "get_concat_dtype" method.

Above you put a prototype like:

class ExtensionDtype:
    ...
    @classmethod
    def get_concat_dtype(cls, dtypes):
        # part of the extension array API
        return None

which I think is supposed to return None if it cannot concat, or the result dtype if it knows how to concat. Correct?
(I proposed to return NotImplemented instead of None, but both are equivalent and only an implementation detail I think).

So some questions related to this:

Should this be "type stable", i.e. the ExtensionDtype.get_concat_dtype only sees types and not values?

This we were just discussing, and I think agree that we want type stability.

I worry a bit about concat([a, b]) giving a different result dtype than concat([b, a]). I'm not sure how problematic that is in practice...

I think that's fine, and will probably the case in any solution we come up with. I am also not sure we actually have cases where this would happen, right now, with our own dtypes?
If we want to avoid this, the solution would be while determining the result dtype to not "stop" when a result dtype is returned, but ask all dtypes in the list for potential result dtypes, and if multiple dtypes were returned that are not the same, raise an error. So I think this is quite possible to do, if we want (but I am not sure it will actually be helpful in practice)

Should we special case self in ExtensionDtype.get_concat_dtype. Right now I made it a class method, but do we want to know that the actual instance?

I suppose in principle it shouldn't matter (since "self" will also be in the list of dtypes). I think it still wouldn't hurt to make it an instance method (in case an EA implementation wants that), since we will always have the instance available, right?

One potential use case might for example be fletcher, where they have a single dtype class and all different dtypes are instances of that same class. Implementation wise it might be helpful for them to have the instance, instead of needing to infer it from the list of dtypes.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Apr 15, 2020

Sorry for all the text :)

Additional question: are we happy with ExtensionDtype.get_concat_dtype, or:

  • do we want to make this private (_get_concat_dtype, or in dunder form __concat_dtype__). I have the feeling it is not needed to have this public for the general user.
  • do we want to have this concat-specific, or might there be use cases for keeping this more general? Eg a get_result_dtype(dtypes, method="concat"), which would allow to add other operations which might want to know a result dtype without needing to add a method per operation.

That last point might be too much "future thinking" though, if we don't yet have any specific use cases

@TomAugspurger
Copy link
Contributor Author

(I proposed to return NotImplemented instead of None, but both are equivalent and only an implementation detail I think).

Agreed, doesn't matter which we use.

I worry a bit about concat([a, b]) giving a different result dtype than concat([b, a]).

I think that's fine, and will probably ...

Agreed, let's not worry about that right now.

I think it still wouldn't hurt to make it an instance method (in case an EA implementation wants that), since we will always have the instance available, right?

This feels like we're inviting order dependent behavior, but yes I think that's fine :)

do we want to make this private (_get_concat_dtype, or in dunder form concat_dtype). I have the feeling it is not needed to have this public for the general user.

Agreed, it should just be for EA authors. I'd prefer the _get_concat_dtype form rather than a dunder, since this isn't the protocol in the sense that there's a top-level method associated with it. It'll just be pandas calling it as part of a pd.concat([...]), not users.

do we want to have this concat-specific, or might there be use cases for keeping this more general

Ideally, I'd like to have similar logic / code backing concat and astype -- really just astype in my ideal world, so concat doesn't have to worry about it. I don't see it beyond that though (e.g. I don't think get_result_type(dtypes, method="add") makes sense).

That said, I recognize that I'm asking for a lot more than an easy way to concat [Int8, Int32] -> Int64.

@jorisvandenbossche
Copy link
Member

Ideally, I'd like to have similar logic / code backing concat and astype -- really just astype in my ideal world, so concat doesn't have to worry about it.

Can you explain this a bit more? I understand that for concat, we just need to know the dtype to pass to astype and the use _concat_same_type. But how would something like get_result_dtype be used in astype?

You mean like: when doing arr1.astype(dtype2), we can do arr1.dtype.get_result_dtype([arr1.dtype, dtype2]) to know the resulting dtype? But we already know this, right? (it's dtype2, or an error).
For astype, we need to know which of the two dtypes implements the conversion, not what the resulting dtype will be (and which is not necessarily the same as which dtype implements the conversion).

Also, for astype, we want to be more liberal in which casting is allowed, while for concat we want to be more strict?

@TomAugspurger
Copy link
Contributor Author

Sorry, may not have been clear. I think I was thinking of get_result_dtype getting dtype2, what the .astype should go to. Finding the right implementation is indeed another matter.

@TomAugspurger
Copy link
Contributor Author

Also, for astype, we want to be more liberal in which casting is allowed, while for concat we want to be more strict?

Yes, that seems correct.

@jorisvandenbossche
Copy link
Member

@TomAugspurger I think this can be closed now (with #33607 for the actual protocol method, and some of the follow-up issue to ensure it is used)

@TomAugspurger
Copy link
Contributor Author

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

3 participants