-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
TYP: _ensure_data and infer_dtype_from_array #44292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -112,16 +112,19 @@ | |
# --------------- # | ||
def _ensure_data(values: ArrayLike) -> np.ndarray: | ||
""" | ||
routine to ensure that our data is of the correct | ||
input dtype for lower-level routines | ||
Ensure values is of the correct input dtype for lower-level routines. | ||
|
||
This will coerce: | ||
- ints -> int64 | ||
- uint -> uint64 | ||
- bool -> uint64 (TODO this should be uint8) | ||
- bool -> uint8 | ||
- datetimelike -> i8 | ||
- datetime64tz -> i8 (in local tz) | ||
- categorical -> codes | ||
- categorical[bool] without nulls -> uint8 | ||
- categorical[bool] with nulls -> ValueError: cannot convert float NaN to integer | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is this tested/intentional? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. looks like this was changed in #41256 although further investigation required on whether this is a latent bug/regression. Just updated the docstring for now to document the actual behavior. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. categorical is fast pathed in In So will need to change that but this is a regression from 1.2.5 so will need to be done separate so can be backported. code sample based on test_drop_duplicates_categorical_bool import pandas as pd
print(pd.__version__)
tc = pd.Series(
pd.Categorical(
[True, False, True, False, pd.NA], categories=[True, False], ordered=True
)
)
print(tc.duplicated())
1.3.4
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/tmp/ipykernel_47357/1277064552.py in <module>
7 )
8 )
----> 9 print(tc.duplicated())
~/miniconda3/envs/pandas-1.3.4/lib/python3.9/site-packages/pandas/core/series.py in duplicated(self, keep)
2215 dtype: bool
2216 """
-> 2217 res = self._duplicated(keep=keep)
2218 result = self._constructor(res, index=self.index)
2219 return result.__finalize__(self, method="duplicated")
~/miniconda3/envs/pandas-1.3.4/lib/python3.9/site-packages/pandas/core/base.py in _duplicated(self, keep)
1230 self, keep: Literal["first", "last", False] = "first"
1231 ) -> np.ndarray:
-> 1232 return duplicated(self._values, keep=keep)
~/miniconda3/envs/pandas-1.3.4/lib/python3.9/site-packages/pandas/core/algorithms.py in duplicated(values, keep)
925 duplicated : ndarray[bool]
926 """
--> 927 values, _ = _ensure_data(values)
928 return htable.duplicated(values, keep=keep)
929
~/miniconda3/envs/pandas-1.3.4/lib/python3.9/site-packages/pandas/core/algorithms.py in _ensure_data(values)
139 # i.e. all-bool Categorical, BooleanArray
140 try:
--> 141 return np.asarray(values).astype("uint8", copy=False), values.dtype
142 except TypeError:
143 # GH#42107 we have pd.NAs present
ValueError: cannot convert float NaN to integer There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. opened #44351 and will convert this to draft till fixed. |
||
- boolean without nulls -> uint8 | ||
- boolean with nulls -> object | ||
|
||
Parameters | ||
---------- | ||
|
@@ -165,10 +168,8 @@ def _ensure_data(values: ArrayLike) -> np.ndarray: | |
return np.asarray(values) | ||
|
||
elif is_complex_dtype(values.dtype): | ||
# Incompatible return value type (got "Tuple[Union[Any, ExtensionArray, | ||
# ndarray[Any, Any]], Union[Any, ExtensionDtype]]", expected | ||
# "Tuple[ndarray[Any, Any], Union[dtype[Any], ExtensionDtype]]") | ||
return values # type: ignore[return-value] | ||
assert isinstance(values, np.ndarray) # for mypy | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. could we potentially get here with PandasArray[complex]? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes. this is coded to return the values whereas it would need to either extract the underlting numpy array or if not ndarray backed would need to coerce to numpy array. This is how it's done in above for It used to be done this way before #42197. Those changes are in released pandas so I guess there are no 3rd party EA devs with issues. The ignore was added in that PR and is not a false positive. We can either revert those changes or as I have done here, use an assert to fail fast. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. or we could leave the ignore for now and add a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i was thinking There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yep, can also fix here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. PandasArray[complex] can't be used to test as the numpy array is extracted from a PandasArray. So I guess will need to setup a dummy EA of complex dtype to test. But, it also appears that we don't have tests where integer and floating EAs pass through There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. so we always call |
||
return values | ||
|
||
# datetimelike | ||
elif needs_i8_conversion(values.dtype): | ||
|
@@ -1723,11 +1724,7 @@ def safe_sort( | |
if not isinstance(values, (np.ndarray, ABCExtensionArray)): | ||
# don't convert to string types | ||
dtype, _ = infer_dtype_from_array(values) | ||
# error: Argument "dtype" to "asarray" has incompatible type "Union[dtype[Any], | ||
# ExtensionDtype]"; expected "Union[dtype[Any], None, type, _SupportsDType, str, | ||
# Union[Tuple[Any, int], Tuple[Any, Union[int, Sequence[int]]], List[Any], | ||
# _DTypeDict, Tuple[Any, Any]]]" | ||
values = np.asarray(values, dtype=dtype) # type: ignore[arg-type] | ||
values = np.asarray(values, dtype=dtype) | ||
|
||
sorter = None | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the ints and uints are unchanged
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i didn't yet check those. will look tomorrow.