Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: concat of nullable int + bool preserves int dtype #34985

Merged
merged 4 commits into from
Jul 8, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v1.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -321,7 +321,7 @@ Other enhancements
- :meth:`DataFrame.hist`, :meth:`Series.hist`, :meth:`core.groupby.DataFrameGroupBy.hist`, and :meth:`core.groupby.SeriesGroupBy.hist` have gained the ``legend`` argument. Set to True to show a legend in the histogram. (:issue:`6279`)
- :func:`concat` and :meth:`~DataFrame.append` now preserve extension dtypes, for example
combining a nullable integer column with a numpy integer column will no longer
result in object dtype but preserve the integer dtype (:issue:`33607`, :issue:`34339`).
result in object dtype but preserve the integer dtype (:issue:`33607`, :issue:`34339`, :issue:`34095`).
- :meth:`~pandas.io.gbq.read_gbq` now allows to disable progress bar (:issue:`33360`).
- :meth:`~pandas.io.gbq.read_gbq` now supports the ``max_results`` kwarg from ``pandas-gbq`` (:issue:`34639`).
- :meth:`DataFrame.cov` and :meth:`Series.cov` now support a new parameter ddof to support delta degrees of freedom as in the corresponding numpy methods (:issue:`34611`).
Expand Down
9 changes: 6 additions & 3 deletions pandas/core/arrays/integer.py
Original file line number Diff line number Diff line change
Expand Up @@ -92,10 +92,13 @@ def construct_array_type(cls) -> Type["IntegerArray"]:
return IntegerArray

def _get_common_dtype(self, dtypes: List[DtypeObj]) -> Optional[DtypeObj]:
# for now only handle other integer types
# we only handle nullable EA dtypes and numeric numpy dtypes
if not all(
isinstance(t, _IntegerDtype)
or (isinstance(t, np.dtype) and np.issubdtype(t, np.integer))
isinstance(t, BaseMaskedDtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry a bit about this. I could easily see other arrays using the BaseMasked stuff (e.g. StringArray) that shouldn't necessarily be considered "integer-like" for this concat.

So I'd be more comfortable with (_IntegerDtype, BooleanDtype) even though those are synonymous today.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree with @TomAugspurger here

Copy link
Member Author

@jorisvandenbossche jorisvandenbossche Jun 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check for BaseMaskedDtype ensures the dtype has a numpy_dtype, and below we still do a np.find_common_type on the result. So assuming you have int + string, numpy will return object dtype for that, in which case we still return None from this function (which is equivalent as making the check here more strict and returning None here).

So even when we make StringArray a masked array, this method should already work as expected.

And doing it this way, I don't have to add FloatingDtype to the list of (_IntegerDtype, BooleanDtype) in the floating PR.

Copy link
Contributor

@TomAugspurger TomAugspurger Jun 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So even when we make StringArray a masked array, this method should already work as expected.

Good to know.

I don't have to add FloatingDtype to the list of (_IntegerDtype, BooleanDtype) in the floating PR.

Will we want to return None here for float? Or I suppose the find_common_type stuff will handle that as well, just like string?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback does #34985 (comment) make sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense, though still feel that there is no downside to being explict about listing the 2 dtypes directy; its more obvious. Agreed that if in the future we add more convertable to integer dtypes they won't automatically be added, but i think that is of lesser benefit that better readability here (i mean you could add a comment, but i think listing the classes is better)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think how it is written now also gives a good message: all BaseMaskedDtype subclasses are supported by this method (which is the case, even though some might return None from the function later on)

Will we want to return None here for float? Or I suppose the find_common_type stuff will handle that as well, just like string?

Yes, right now we would still return None for float (only if the common numpy dtype is an integer dtype, an EA dtype is returned). Once we have FloatingArray, we would add an additional check for the case that the common numpy dtype is a float dtype, and then return an EA floating dtype.

or (
isinstance(t, np.dtype)
and (np.issubdtype(t, np.number) or np.issubdtype(t, np.bool_))
)
for t in dtypes
):
return None
Expand Down
45 changes: 43 additions & 2 deletions pandas/tests/arrays/integer/test_concat.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import numpy as np
import pytest

import pandas as pd
Expand All @@ -15,12 +16,52 @@
(["Int32", "UInt32"], "Int64"),
# this still gives object (awaiting float extension dtype)
(["Int64", "UInt64"], "object"),
(["Int64", "boolean"], "Int64"),
(["UInt8", "boolean"], "UInt8"),
],
)
def test_concat_series(to_concat_dtypes, result_dtype):

result = pd.concat([pd.Series([1, 2, pd.NA], dtype=t) for t in to_concat_dtypes])
expected = pd.concat([pd.Series([1, 2, pd.NA], dtype=object)] * 2).astype(
result = pd.concat([pd.Series([0, 1, pd.NA], dtype=t) for t in to_concat_dtypes])
expected = pd.concat([pd.Series([0, 1, pd.NA], dtype=object)] * 2).astype(
result_dtype
)
tm.assert_series_equal(result, expected)

# order doesn't matter for result
result = pd.concat(
[pd.Series([0, 1, pd.NA], dtype=t) for t in to_concat_dtypes[::-1]]
)
expected = pd.concat([pd.Series([0, 1, pd.NA], dtype=object)] * 2).astype(
result_dtype
)
tm.assert_series_equal(result, expected)


@pytest.mark.parametrize(
"to_concat_dtypes, result_dtype",
[
(["Int64", "int64"], "Int64"),
(["UInt64", "uint64"], "UInt64"),
(["Int8", "int8"], "Int8"),
(["Int8", "int16"], "Int16"),
(["UInt8", "int8"], "Int16"),
(["Int32", "uint32"], "Int64"),
# this still gives object (awaiting float extension dtype)
(["Int64", "uint64"], "object"),
(["Int64", "bool"], "Int64"),
(["UInt8", "bool"], "UInt8"),
],
)
def test_concat_series_with_numpy(to_concat_dtypes, result_dtype):

s1 = pd.Series([0, 1, pd.NA], dtype=to_concat_dtypes[0])
s2 = pd.Series(np.array([0, 1], dtype=to_concat_dtypes[1]))
result = pd.concat([s1, s2], ignore_index=True)
expected = pd.Series([0, 1, pd.NA, 0, 1], dtype=object).astype(result_dtype)
tm.assert_series_equal(result, expected)

# order doesn't matter for result
result = pd.concat([s2, s1], ignore_index=True)
expected = pd.Series([0, 1, 0, 1, pd.NA], dtype=object).astype(result_dtype)
tm.assert_series_equal(result, expected)