-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH/PERF: use mask in factorize for nullable dtypes #33064
Changes from 1 commit
246b787
af1cdea
d030ff1
be5a21d
b0a88b9
2e94842
bcde548
c97d357
175a8f6
a6bc6fc
889b1b9
0370045
24b528f
6ed5239
1f7e994
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -27,6 +27,7 @@ | |
from pandas.core.dtypes.missing import isna | ||
|
||
from pandas.core import nanops, ops | ||
from pandas.core.algorithms import _factorize_array | ||
import pandas.core.common as com | ||
from pandas.core.indexers import check_array_indexer | ||
from pandas.core.ops import invalid_comparison | ||
|
@@ -481,7 +482,16 @@ def astype(self, dtype, copy: bool = True) -> ArrayLike: | |
def _values_for_factorize(self) -> Tuple[np.ndarray, float]: | ||
# TODO: https://github.com/pandas-dev/pandas/issues/30037 | ||
# use masked algorithms, rather than object-dtype / np.nan. | ||
return self.to_numpy(na_value=np.nan), np.nan | ||
return self.to_numpy(dtype=float, na_value=np.nan), np.nan | ||
|
||
def factorize2(self, na_sentinel: int = -1) -> Tuple[np.ndarray, "ExtensionArray"]: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is of course not meant to stay. I think we have two options:
Short term, the first is the easiest. But long term, I think the second would be nice to allow external EAs to more easily use this as well (avoiding the need they have to override the base class There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I also think that expanding Could we instead (or also) have a class attribute like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Even in such a case, I would expect the na_value not to be an array. But OK, it's certainly not the most robust way. An alternative could also be to return 3 values (values, None, mask) in case of a mask, and then we can check the number of items returned. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. option 2 (return a mask) would be my preference, even if its a breaking change, much cleaner There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think we should simply break this (we would break other projects like GeoPandas), since it is relatively easily avoidable. If we want to replace |
||
arr = self._data | ||
mask = self._mask | ||
|
||
codes, uniques = _factorize_array(arr, na_sentinel=na_sentinel, mask=mask) | ||
|
||
uniques = IntegerArray(uniques, np.zeros(len(uniques), dtype=bool)) | ||
return codes, uniques | ||
|
||
def _values_for_argsort(self) -> np.ndarray: | ||
""" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we're worried about perf for existing cases, could take this check outside of the loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to check the mask for each value inside the loop, so not sure what can be moved outside?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was referring to the use_mask check; it would basically become a separate loop or even method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think duplicating the full loop is worth it (the loop itself is 40 lines below here), given the minor performance impact I showed in the timings.