Skip to content

TYP: _ensure_data and infer_dtype_from_array #44292

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 9 additions & 12 deletions pandas/core/algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -112,16 +112,19 @@
# --------------- #
def _ensure_data(values: ArrayLike) -> np.ndarray:
"""
routine to ensure that our data is of the correct
input dtype for lower-level routines
Ensure values is of the correct input dtype for lower-level routines.

This will coerce:
- ints -> int64
- uint -> uint64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the ints and uints are unchanged

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i didn't yet check those. will look tomorrow.

- bool -> uint64 (TODO this should be uint8)
- bool -> uint8
- datetimelike -> i8
- datetime64tz -> i8 (in local tz)
- categorical -> codes
- categorical[bool] without nulls -> uint8
- categorical[bool] with nulls -> ValueError: cannot convert float NaN to integer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this tested/intentional?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this was changed in #41256 although further investigation required on whether this is a latent bug/regression. Just updated the docstring for now to document the actual behavior.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

categorical is fast pathed in mode so does not pass through _ensure_data. So the regression fix in #42131 only required the except TypeError to fix.

In duplicated and drop_duplicates the categorical EA is passed through _ensure_data and so raises ValueError which is not caught by the fix in #42131.

So will need to change that but this is a regression from 1.2.5 so will need to be done separate so can be backported.

code sample based on test_drop_duplicates_categorical_bool

import pandas as pd

print(pd.__version__)
tc = pd.Series(
    pd.Categorical(
        [True, False, True, False, pd.NA], categories=[True, False], ordered=True
    )
)
print(tc.duplicated())
1.2.5
0    False
1    False
2     True
3     True
4    False
dtype: bool
1.3.4
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_47357/1277064552.py in <module>
      7     )
      8 )
----> 9 print(tc.duplicated())

~/miniconda3/envs/pandas-1.3.4/lib/python3.9/site-packages/pandas/core/series.py in duplicated(self, keep)
   2215         dtype: bool
   2216         """
-> 2217         res = self._duplicated(keep=keep)
   2218         result = self._constructor(res, index=self.index)
   2219         return result.__finalize__(self, method="duplicated")

~/miniconda3/envs/pandas-1.3.4/lib/python3.9/site-packages/pandas/core/base.py in _duplicated(self, keep)
   1230         self, keep: Literal["first", "last", False] = "first"
   1231     ) -> np.ndarray:
-> 1232         return duplicated(self._values, keep=keep)

~/miniconda3/envs/pandas-1.3.4/lib/python3.9/site-packages/pandas/core/algorithms.py in duplicated(values, keep)
    925     duplicated : ndarray[bool]
    926     """
--> 927     values, _ = _ensure_data(values)
    928     return htable.duplicated(values, keep=keep)
    929 

~/miniconda3/envs/pandas-1.3.4/lib/python3.9/site-packages/pandas/core/algorithms.py in _ensure_data(values)
    139             # i.e. all-bool Categorical, BooleanArray
    140             try:
--> 141                 return np.asarray(values).astype("uint8", copy=False), values.dtype
    142             except TypeError:
    143                 # GH#42107 we have pd.NAs present

ValueError: cannot convert float NaN to integer

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opened #44351 and will convert this to draft till fixed.

- boolean without nulls -> uint8
- boolean with nulls -> object

Parameters
----------
Expand Down Expand Up @@ -165,10 +168,8 @@ def _ensure_data(values: ArrayLike) -> np.ndarray:
return np.asarray(values)

elif is_complex_dtype(values.dtype):
# Incompatible return value type (got "Tuple[Union[Any, ExtensionArray,
# ndarray[Any, Any]], Union[Any, ExtensionDtype]]", expected
# "Tuple[ndarray[Any, Any], Union[dtype[Any], ExtensionDtype]]")
return values # type: ignore[return-value]
assert isinstance(values, np.ndarray) # for mypy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we potentially get here with PandasArray[complex]?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. this is coded to return the values whereas it would need to either extract the underlting numpy array or if not ndarray backed would need to coerce to numpy array. This is how it's done in above for is_float_dtype.

It used to be done this way before #42197. Those changes are in released pandas so I guess there are no 3rd party EA devs with issues.

The ignore was added in that PR and is not a false positive. We can either revert those changes or as I have done here, use an assert to fail fast.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or we could leave the ignore for now and add a TODO: This is NOT a false positive

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was thinking return np.asarray(values)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, can also fix here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PandasArray[complex] can't be used to test as the numpy array is extracted from a PandasArray. So I guess will need to setup a dummy EA of complex dtype to test.

But, it also appears that we don't have tests where integer and floating EAs pass through _ensure_data. Need to investigate this further as we either need tests or can remove code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we always call extract_array(foo, extract_numpy=True) before getting here? if so, then a cast/ignore/assert seems benign.

return values

# datetimelike
elif needs_i8_conversion(values.dtype):
Expand Down Expand Up @@ -1723,11 +1724,7 @@ def safe_sort(
if not isinstance(values, (np.ndarray, ABCExtensionArray)):
# don't convert to string types
dtype, _ = infer_dtype_from_array(values)
# error: Argument "dtype" to "asarray" has incompatible type "Union[dtype[Any],
# ExtensionDtype]"; expected "Union[dtype[Any], None, type, _SupportsDType, str,
# Union[Tuple[Any, int], Tuple[Any, Union[int, Sequence[int]]], List[Any],
# _DTypeDict, Tuple[Any, Any]]]"
values = np.asarray(values, dtype=dtype) # type: ignore[arg-type]
values = np.asarray(values, dtype=dtype)

sorter = None

Expand Down
20 changes: 20 additions & 0 deletions pandas/core/dtypes/cast.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
from typing import (
TYPE_CHECKING,
Any,
Literal,
Sized,
TypeVar,
cast,
Expand Down Expand Up @@ -795,6 +796,25 @@ def dict_compat(d: dict[Scalar, Scalar]) -> dict[Scalar, Scalar]:
return {maybe_box_datetimelike(key): value for key, value in d.items()}


@overload
def infer_dtype_from_array(
arr,
) -> tuple[np.dtype, ArrayLike]:
...


@overload
def infer_dtype_from_array(
arr, pandas_dtype: Literal[False] = ...
) -> tuple[np.dtype, ArrayLike]:
...


@overload
def infer_dtype_from_array(arr, pandas_dtype: bool = ...) -> tuple[DtypeObj, ArrayLike]:
...


def infer_dtype_from_array(
arr, pandas_dtype: bool = False
) -> tuple[DtypeObj, ArrayLike]:
Expand Down