Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH/PERF: use mask in factorize for nullable dtypes #33064

Merged
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 28 additions & 7 deletions pandas/_libs/hashtable_class_helper.pxi.in
Original file line number Diff line number Diff line change
Expand Up @@ -368,7 +368,7 @@ cdef class {{name}}HashTable(HashTable):
def _unique(self, const {{dtype}}_t[:] values, {{name}}Vector uniques,
Py_ssize_t count_prior=0, Py_ssize_t na_sentinel=-1,
object na_value=None, bint ignore_na=False,
bint return_inverse=False):
object mask=None, bint return_inverse=False):
"""
Calculate unique values and labels (no sorting!)

Expand All @@ -391,6 +391,10 @@ cdef class {{name}}HashTable(HashTable):
Whether NA-values should be ignored for calculating the uniques. If
True, the labels corresponding to missing values will be set to
na_sentinel.
mask : ndarray[bool], optional
If not None, the mask is used as indicator for missing values
(True = missing, False = valid) instead of `na_value` or
condition "val != val".
return_inverse : boolean, default False
Whether the mapping of the original array values to their location
in the vector of uniques should be returned.
Expand All @@ -409,12 +413,17 @@ cdef class {{name}}HashTable(HashTable):
{{dtype}}_t val, na_value2
khiter_t k
{{name}}VectorData *ud
bint use_na_value
bint use_na_value, use_mask
uint8_t[:] mask_values

if return_inverse:
labels = np.empty(n, dtype=np.int64)
ud = uniques.data
use_na_value = na_value is not None
use_mask = mask is not None

if use_mask:
mask_values = mask.view("uint8")

if use_na_value:
# We need this na_value2 because we want to allow users
Expand All @@ -430,7 +439,11 @@ cdef class {{name}}HashTable(HashTable):
for i in range(n):
val = values[i]

if ignore_na and (
if ignore_na and use_mask:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we're worried about perf for existing cases, could take this check outside of the loop?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to check the mask for each value inside the loop, so not sure what can be moved outside?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was referring to the use_mask check; it would basically become a separate loop or even method

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think duplicating the full loop is worth it (the loop itself is 40 lines below here), given the minor performance impact I showed in the timings.

if mask_values[i]:
labels[i] = na_sentinel
continue
elif ignore_na and (
{{if not name.lower().startswith(("uint", "int"))}}
val != val or
{{endif}}
Expand Down Expand Up @@ -494,7 +507,7 @@ cdef class {{name}}HashTable(HashTable):
return_inverse=return_inverse)

def factorize(self, const {{dtype}}_t[:] values, Py_ssize_t na_sentinel=-1,
object na_value=None):
object na_value=None, object mask=None):
"""
Calculate unique values and labels (no sorting!)

Expand All @@ -512,6 +525,10 @@ cdef class {{name}}HashTable(HashTable):
any value "val" satisfying val != val is considered missing.
If na_value is not None, then _additionally_, any value "val"
satisfying val == na_value is considered missing.
mask : ndarray[bool], optional
If not None, the mask is used as indicator for missing values
(True = missing, False = valid) instead of `na_value` or
condition "val != val".

Returns
-------
Expand All @@ -522,7 +539,7 @@ cdef class {{name}}HashTable(HashTable):
"""
uniques_vector = {{name}}Vector()
return self._unique(values, uniques_vector, na_sentinel=na_sentinel,
na_value=na_value, ignore_na=True,
na_value=na_value, ignore_na=True, mask=mask,
return_inverse=True)

def get_labels(self, const {{dtype}}_t[:] values, {{name}}Vector uniques,
Expand Down Expand Up @@ -855,7 +872,7 @@ cdef class StringHashTable(HashTable):
return_inverse=return_inverse)

def factorize(self, ndarray[object] values, Py_ssize_t na_sentinel=-1,
object na_value=None):
object na_value=None, object mask=None):
"""
Calculate unique values and labels (no sorting!)

Expand All @@ -873,6 +890,8 @@ cdef class StringHashTable(HashTable):
that is not a string is considered missing. If na_value is
not None, then _additionally_ any value "val" satisfying
val == na_value is considered missing.
mask : ndarray[bool], optional
Not yet implementd for StringHashTable.

Returns
-------
Expand Down Expand Up @@ -1094,7 +1113,7 @@ cdef class PyObjectHashTable(HashTable):
return_inverse=return_inverse)

def factorize(self, ndarray[object] values, Py_ssize_t na_sentinel=-1,
object na_value=None):
object na_value=None, object mask=None):
"""
Calculate unique values and labels (no sorting!)

Expand All @@ -1112,6 +1131,8 @@ cdef class PyObjectHashTable(HashTable):
any value "val" satisfying val != val is considered missing.
If na_value is not None, then _additionally_, any value "val"
satisfying val == na_value is considered missing.
mask : ndarray[bool], optional
Not yet implemented for PyObjectHashTable.

Returns
-------
Expand Down
10 changes: 8 additions & 2 deletions pandas/core/algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -455,7 +455,7 @@ def isin(comps: AnyArrayLike, values: AnyArrayLike) -> np.ndarray:


def _factorize_array(
values, na_sentinel: int = -1, size_hint=None, na_value=None
values, na_sentinel: int = -1, size_hint=None, na_value=None, mask=None,
) -> Tuple[np.ndarray, np.ndarray]:
"""
Factorize an array-like to codes and uniques.
Expand All @@ -473,6 +473,10 @@ def _factorize_array(
parameter when you know that you don't have any values pandas would
consider missing in the array (NaN for float data, iNaT for
datetimes, etc.).
mask : ndarray[bool], optional
If not None, the mask is used as indicator for missing values
(True = missing, False = valid) instead of `na_value` or
condition "val != val".

Returns
-------
Expand All @@ -482,7 +486,9 @@ def _factorize_array(
hash_klass, values = _get_data_algo(values)

table = hash_klass(size_hint or len(values))
uniques, codes = table.factorize(values, na_sentinel=na_sentinel, na_value=na_value)
uniques, codes = table.factorize(
values, na_sentinel=na_sentinel, na_value=na_value, mask=mask
)

codes = ensure_platform_int(codes)
return codes, uniques
Expand Down
12 changes: 11 additions & 1 deletion pandas/core/arrays/integer.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@
from pandas.core.dtypes.missing import isna

from pandas.core import nanops, ops
from pandas.core.algorithms import _factorize_array
import pandas.core.common as com
from pandas.core.indexers import check_array_indexer
from pandas.core.ops import invalid_comparison
Expand Down Expand Up @@ -481,7 +482,16 @@ def astype(self, dtype, copy: bool = True) -> ArrayLike:
def _values_for_factorize(self) -> Tuple[np.ndarray, float]:
# TODO: https://github.com/pandas-dev/pandas/issues/30037
# use masked algorithms, rather than object-dtype / np.nan.
return self.to_numpy(na_value=np.nan), np.nan
return self.to_numpy(dtype=float, na_value=np.nan), np.nan

def factorize2(self, na_sentinel: int = -1) -> Tuple[np.ndarray, "ExtensionArray"]:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is of course not meant to stay. I think we have two options:

  • Override the base class factorize here on IntegerArray to use the mask (so basically rename factorize2 to factorize)
  • Expand the _values_for_factorize spec to allow it to return a mask instead of na_value. The base class factorize then would need to choose the correct path depending on whether _values_for_factorize returned a boolean ndarray, or something else (scalar na_value)

Short term, the first is the easiest. But long term, I think the second would be nice to allow external EAs to more easily use this as well (avoiding the need they have to override the base class factorize as well). But this second option has some more implications (need to update everywhere it is used internally).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think that expanding _values_for_factorize is the way to go, but I worry about inferring the EA wants a masked based on na_value being a boolean ndarray. It's an edge case, but consider a nested / ragged array where each "scalar" element is an ndarray.

Could we instead (or also) have a class attribute like _masked_factorize = True indicating that they want to used the masked variant?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an edge case, but consider a nested / ragged array where each "scalar" element is an ndarray.

Even in such a case, I would expect the na_value not to be an array. But OK, it's certainly not the most robust way.

An alternative could also be to return 3 values (values, None, mask) in case of a mask, and then we can check the number of items returned.
Also not the cleanest solution, though. I just also don't really like the _masked_factorize attribute .. Certainly if we would start doing this in other places as well, and then might need other similar attributes. Unless we would go for a single _is_masked attribute and then you need to go all or nothing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

option 2 (return a mask) would be my preference, even if its a breaking change, much cleaner

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should simply break this (we would break other projects like GeoPandas), since it is relatively easily avoidable.

If we want to replace na_value in factorize with passing a mask (with a given na_value you can compute the mask in advance with values == na_value), we could deprecate the option of returning an na_value from _values_for_factorize, though.

arr = self._data
mask = self._mask

codes, uniques = _factorize_array(arr, na_sentinel=na_sentinel, mask=mask)

uniques = IntegerArray(uniques, np.zeros(len(uniques), dtype=bool))
return codes, uniques

def _values_for_argsort(self) -> np.ndarray:
"""
Expand Down