Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG/TST: Fix infer_dtype for Period array-likes and general ExtensionArrays #37367

Merged
merged 15 commits into from
Feb 12, 2021
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 15 additions & 7 deletions pandas/_libs/lib.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -1073,6 +1073,7 @@ _TYPE_MAP = {
"M": "datetime64",
"timedelta64[ns]": "timedelta64",
"m": "timedelta64",
"period": "period",
"interval": "interval",
}

Expand Down Expand Up @@ -1226,9 +1227,14 @@ cdef object _try_infer_map(object dtype):
object val
str attr
for attr in ["name", "kind", "base"]:
val = getattr(dtype, attr)
val = getattr(dtype, attr, None)
jbrockmendel marked this conversation as resolved.
Show resolved Hide resolved
if val in _TYPE_MAP:
return _TYPE_MAP[val]
# also check base name for parametrized dtypes (eg period[D])
if isinstance(val, str):
val = val.split("[")[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we explicity add the Period[D] types to the type_map (there aren't that many)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going from the PeriodDtypeCode enum, there are actually quite some?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we were to get that specific, at some point it makes more sense to just return a dtype object

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right here why can't we try to convert to an EA type using registry.find()?

Copy link
Member Author

@jorisvandenbossche jorisvandenbossche Nov 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we already have a dtype here, there is no need to look anything up in the registry. We don't want to infer a type, we have an EA dtype that we want to find in the TYPE_MAP to infer the category returned by infer_dtype, by checking the name, kind or base attributes of the dtype.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to import those in lib.pyx ? (might be fine, eg there are already imports from tslibs.period, but at the moment only cimport s)

If that's OK, I am fine with changing it. I don't necessarily find it less "hacky" than the current solution, but I just want some solution that is acceptable for all

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we not already import all of the scalar EA types?
why is this any different

+1 on using the existing machinery

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't necessarily find it less "hacky" than the current solution, but I just want some solution that is acceptable for all

100% agree on both points

Do we want to import those in lib.pyx

not ideal, but i dont think it will harm anything ATM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and please don't say that I should convert the string to a dtype, as you can see in the code a few lines above, we actually start from a dtype object)

exactly you are missing the entire point of the dtype abstraction

you avoid parsing strings in the first place

i will be blocking this util / unless a good soln is done

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we not already import all of the scalar EA types?

lib.pyx doesn't know anything about EAs. It only imports helper functions like is_period_object

why is this any different

different than what?

if val in _TYPE_MAP:
jreback marked this conversation as resolved.
Show resolved Hide resolved
return _TYPE_MAP[val]
return None


Expand Down Expand Up @@ -1267,6 +1273,7 @@ def infer_dtype(value: object, skipna: bool = True) -> str:
- time
- period
- mixed
- unknown-array

Raises
------
Expand All @@ -1279,6 +1286,9 @@ def infer_dtype(value: object, skipna: bool = True) -> str:
specialized
- 'mixed-integer-float' are floats and integers
- 'mixed-integer' are integers mixed with non-integers
- 'unknown-array' is the catchall for something that *is* an array (has
a dtype attribute), but has a dtype unknown to pandas (e.g. external
extension array)

Examples
--------
Expand Down Expand Up @@ -1347,12 +1357,10 @@ def infer_dtype(value: object, skipna: bool = True) -> str:
# e.g. categoricals
dtype = value.dtype
if not isinstance(dtype, np.dtype):
value = _try_infer_map(value.dtype)
if value is not None:
return value

# its ndarray-like but we can't handle
raise ValueError(f"cannot infer type for {type(value)}")
inferred = _try_infer_map(value.dtype)
if inferred is not None:
return inferred
return "unknown-array"

# Unwrap Series/Index
values = np.asarray(value)
Expand Down
6 changes: 1 addition & 5 deletions pandas/core/strings/accessor.py
Original file line number Diff line number Diff line change
Expand Up @@ -208,11 +208,7 @@ def _validate(data):
if isinstance(values.dtype, StringDtype):
return "string"

try:
inferred_dtype = lib.infer_dtype(values, skipna=True)
except ValueError:
# GH#27571 mostly occurs with ExtensionArray
inferred_dtype = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's another case almost identical to this in dtypes.cast.convert_dtypes

inferred_dtype = lib.infer_dtype(values, skipna=True)

if inferred_dtype not in allowed_types:
raise AttributeError("Can only use .str accessor with string values!")
Expand Down
13 changes: 13 additions & 0 deletions pandas/tests/dtypes/test_inference.py
Original file line number Diff line number Diff line change
Expand Up @@ -891,6 +891,19 @@ def test_infer_dtype_period(self):
arr = np.array([Period("2011-01", freq="D"), Period("2011-02", freq="M")])
assert lib.infer_dtype(arr, skipna=True) == "period"

jreback marked this conversation as resolved.
Show resolved Hide resolved
@pytest.mark.parametrize("klass", [pd.array, pd.Series, pd.Index])
@pytest.mark.parametrize("skipna", [True, False])
def test_infer_dtype_period_array(self, klass, skipna):
# https://github.com/pandas-dev/pandas/issues/23553
values = klass(
[
Period("2011-01-01", freq="D"),
Period("2011-01-02", freq="D"),
pd.NaT,
]
)
assert lib.infer_dtype(values, skipna=skipna) == "period"

def test_infer_dtype_period_mixed(self):
arr = np.array(
[Period("2011-01", freq="M"), np.datetime64("nat")], dtype=object
Expand Down
8 changes: 8 additions & 0 deletions pandas/tests/extension/base/dtype.py
Original file line number Diff line number Diff line change
Expand Up @@ -123,3 +123,11 @@ def test_get_common_dtype(self, dtype):
# still testing as good practice to have this working (and it is the
# only case we can test in general)
assert dtype._get_common_dtype([dtype]) == dtype

@pytest.mark.parametrize("skipna", [True, False])
def test_infer_dtype(self, data, data_missing, skipna):
# only testing that this works without raising an error
res = pd.api.types.infer_dtype(data, skipna=skipna)
assert isinstance(res, str)
res = pd.api.types.infer_dtype(data_missing, skipna=skipna)
assert isinstance(res, str)
7 changes: 7 additions & 0 deletions pandas/tests/extension/decimal/test_decimal.py
Original file line number Diff line number Diff line change
Expand Up @@ -116,6 +116,13 @@ class TestDtype(BaseDecimal, base.BaseDtypeTests):
def test_hashable(self, dtype):
pass

@pytest.mark.parametrize("skipna", [True, False])
def test_infer_dtype(self, data, data_missing, skipna):
# here overriding base test to ensure we fall back to return
# "unknown-array" for an EA pandas doesn't know
assert pd.api.types.infer_dtype(data, skipna=skipna) == "unknown-array"
assert pd.api.types.infer_dtype(data_missing, skipna=skipna) == "unknown-array"


class TestInterface(BaseDecimal, base.BaseInterfaceTests):
pass
Expand Down