Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH/PERF: use mask in factorize for nullable dtypes #33064

Merged
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 19 additions & 2 deletions asv_bench/benchmarks/algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,16 @@ class Factorize:
params = [
[True, False],
[True, False],
["int", "uint", "float", "string", "datetime64[ns]", "datetime64[ns, tz]"],
[
"int",
"uint",
"float",
"string",
"datetime64[ns]",
"datetime64[ns, tz]",
"Int64",
"boolean",
],
]
param_names = ["unique", "sort", "dtype"]

Expand All @@ -49,13 +58,21 @@ def setup(self, unique, sort, dtype):
"datetime64[ns, tz]": pd.date_range(
"2011-01-01", freq="H", periods=N, tz="Asia/Tokyo"
),
"Int64": pd.array(np.arange(N), dtype="Int64"),
"boolean": pd.array(np.random.randint(0, 2, N), dtype="boolean"),
}[dtype]
if not unique:
data = data.repeat(5)
self.idx = data
if dtype in ("Int64", "boolean") and sort:
# sort is not a keyword on EAs
raise NotImplementedError

def time_factorize(self, unique, sort, dtype):
self.idx.factorize(sort=sort)
if sort:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this redudant? since sort is a parameter?

self.idx.factorize(sort=sort)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ExtensionArrays don't support the sort keyword, the other values are Index objects, which have that keyword.
So the tests for sort=True are skipped above in case of idx being an EA

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is very confusing then. I would separate the EAs out to a separate asv.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is very confusing then. I would separate the EAs out to a separate asv.

Agree this is confusing. But I switched to use the factorize function in the hope to make this clearer, and to keep a single benchmark (the index method is just simply calling pd.factorize on itself, so this should benchmark the exact same thing).
And that way, we can actually remove the skip for sort for EAs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback updated. Much better now I think (and fine for a single benchmark class/function)

self.idx.factorize(sort=sort)
else:
self.idx.factorize()


class Duplicated:
Expand Down
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -274,6 +274,7 @@ Performance improvements
:meth:`DataFrame.sparse.from_spmatrix` constructor (:issue:`32821`,
:issue:`32825`, :issue:`32826`, :issue:`32856`, :issue:`32858`).
- Performance improvement in :meth:`Series.sum` for nullable (integer and boolean) dtypes (:issue:`30982`).
- Performance improvement in :func:`factorize` for nullable (integer and boolean) dtypes (:issue:`33064`).


.. ---------------------------------------------------------------------------
Expand Down
35 changes: 28 additions & 7 deletions pandas/_libs/hashtable_class_helper.pxi.in
Original file line number Diff line number Diff line change
Expand Up @@ -368,7 +368,7 @@ cdef class {{name}}HashTable(HashTable):
def _unique(self, const {{dtype}}_t[:] values, {{name}}Vector uniques,
Py_ssize_t count_prior=0, Py_ssize_t na_sentinel=-1,
object na_value=None, bint ignore_na=False,
bint return_inverse=False):
object mask=None, bint return_inverse=False):
"""
Calculate unique values and labels (no sorting!)

Expand All @@ -391,6 +391,10 @@ cdef class {{name}}HashTable(HashTable):
Whether NA-values should be ignored for calculating the uniques. If
True, the labels corresponding to missing values will be set to
na_sentinel.
mask : ndarray[bool], optional
If not None, the mask is used as indicator for missing values
(True = missing, False = valid) instead of `na_value` or
condition "val != val".
return_inverse : boolean, default False
Whether the mapping of the original array values to their location
in the vector of uniques should be returned.
Expand All @@ -409,12 +413,17 @@ cdef class {{name}}HashTable(HashTable):
{{dtype}}_t val, na_value2
khiter_t k
{{name}}VectorData *ud
bint use_na_value
bint use_na_value, use_mask
uint8_t[:] mask_values

if return_inverse:
labels = np.empty(n, dtype=np.int64)
ud = uniques.data
use_na_value = na_value is not None
use_mask = mask is not None

if use_mask:
mask_values = mask.view("uint8")

if use_na_value:
# We need this na_value2 because we want to allow users
Expand All @@ -430,7 +439,11 @@ cdef class {{name}}HashTable(HashTable):
for i in range(n):
val = values[i]

if ignore_na and (
if ignore_na and use_mask:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we're worried about perf for existing cases, could take this check outside of the loop?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to check the mask for each value inside the loop, so not sure what can be moved outside?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was referring to the use_mask check; it would basically become a separate loop or even method

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think duplicating the full loop is worth it (the loop itself is 40 lines below here), given the minor performance impact I showed in the timings.

if mask_values[i]:
labels[i] = na_sentinel
continue
elif ignore_na and (
{{if not name.lower().startswith(("uint", "int"))}}
val != val or
{{endif}}
Expand Down Expand Up @@ -494,7 +507,7 @@ cdef class {{name}}HashTable(HashTable):
return_inverse=return_inverse)

def factorize(self, const {{dtype}}_t[:] values, Py_ssize_t na_sentinel=-1,
object na_value=None):
object na_value=None, object mask=None):
"""
Calculate unique values and labels (no sorting!)

Expand All @@ -512,6 +525,10 @@ cdef class {{name}}HashTable(HashTable):
any value "val" satisfying val != val is considered missing.
If na_value is not None, then _additionally_, any value "val"
satisfying val == na_value is considered missing.
mask : ndarray[bool], optional
If not None, the mask is used as indicator for missing values
(True = missing, False = valid) instead of `na_value` or
condition "val != val".

Returns
-------
Expand All @@ -522,7 +539,7 @@ cdef class {{name}}HashTable(HashTable):
"""
uniques_vector = {{name}}Vector()
return self._unique(values, uniques_vector, na_sentinel=na_sentinel,
na_value=na_value, ignore_na=True,
na_value=na_value, ignore_na=True, mask=mask,
return_inverse=True)

def get_labels(self, const {{dtype}}_t[:] values, {{name}}Vector uniques,
Expand Down Expand Up @@ -855,7 +872,7 @@ cdef class StringHashTable(HashTable):
return_inverse=return_inverse)

def factorize(self, ndarray[object] values, Py_ssize_t na_sentinel=-1,
object na_value=None):
object na_value=None, object mask=None):
"""
Calculate unique values and labels (no sorting!)

Expand All @@ -873,6 +890,8 @@ cdef class StringHashTable(HashTable):
that is not a string is considered missing. If na_value is
not None, then _additionally_ any value "val" satisfying
val == na_value is considered missing.
mask : ndarray[bool], optional
Not yet implementd for StringHashTable.

Returns
-------
Expand Down Expand Up @@ -1094,7 +1113,7 @@ cdef class PyObjectHashTable(HashTable):
return_inverse=return_inverse)

def factorize(self, ndarray[object] values, Py_ssize_t na_sentinel=-1,
object na_value=None):
object na_value=None, object mask=None):
"""
Calculate unique values and labels (no sorting!)

Expand All @@ -1112,6 +1131,8 @@ cdef class PyObjectHashTable(HashTable):
any value "val" satisfying val != val is considered missing.
If na_value is not None, then _additionally_, any value "val"
satisfying val == na_value is considered missing.
mask : ndarray[bool], optional
Not yet implemented for PyObjectHashTable.

Returns
-------
Expand Down
10 changes: 8 additions & 2 deletions pandas/core/algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -455,7 +455,7 @@ def isin(comps: AnyArrayLike, values: AnyArrayLike) -> np.ndarray:


def _factorize_array(
values, na_sentinel: int = -1, size_hint=None, na_value=None
values, na_sentinel: int = -1, size_hint=None, na_value=None, mask=None,
) -> Tuple[np.ndarray, np.ndarray]:
"""
Factorize an array-like to codes and uniques.
Expand All @@ -473,6 +473,10 @@ def _factorize_array(
parameter when you know that you don't have any values pandas would
consider missing in the array (NaN for float data, iNaT for
datetimes, etc.).
mask : ndarray[bool], optional
If not None, the mask is used as indicator for missing values
(True = missing, False = valid) instead of `na_value` or
condition "val != val".

Returns
-------
Expand All @@ -482,7 +486,9 @@ def _factorize_array(
hash_klass, values = _get_data_algo(values)

table = hash_klass(size_hint or len(values))
uniques, codes = table.factorize(values, na_sentinel=na_sentinel, na_value=na_value)
uniques, codes = table.factorize(
values, na_sentinel=na_sentinel, na_value=na_value, mask=mask
)

codes = ensure_platform_int(codes)
return codes, uniques
Expand Down
13 changes: 4 additions & 9 deletions pandas/core/arrays/boolean.py
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,10 @@ def type(self) -> Type[np.bool_]:
def kind(self) -> str:
return "b"

@property
def numpy_dtype(self) -> np.dtype:
return np.dtype("bool")

@classmethod
def construct_array_type(cls) -> Type["BooleanArray"]:
"""
Expand Down Expand Up @@ -314,15 +318,6 @@ def map_string(s):
scalars = [map_string(x) for x in strings]
return cls._from_sequence(scalars, dtype, copy)

def _values_for_factorize(self) -> Tuple[np.ndarray, int]:
data = self._data.astype("int8")
data[self._mask] = -1
return data, -1

@classmethod
def _from_factorized(cls, values, original: "BooleanArray") -> "BooleanArray":
return cls._from_sequence(values, dtype=original.dtype)

_HANDLED_TYPES = (np.ndarray, numbers.Number, bool, np.bool_)

def __array_ufunc__(self, ufunc, method: str, *inputs, **kwargs):
Expand Down
9 changes: 0 additions & 9 deletions pandas/core/arrays/integer.py
Original file line number Diff line number Diff line change
Expand Up @@ -366,10 +366,6 @@ def _from_sequence_of_strings(
scalars = to_numeric(strings, errors="raise")
return cls._from_sequence(scalars, dtype, copy)

@classmethod
def _from_factorized(cls, values, original) -> "IntegerArray":
return integer_array(values, dtype=original.dtype)

_HANDLED_TYPES = (np.ndarray, numbers.Number)

def __array_ufunc__(self, ufunc, method: str, *inputs, **kwargs):
Expand Down Expand Up @@ -479,11 +475,6 @@ def astype(self, dtype, copy: bool = True) -> ArrayLike:
data = self.to_numpy(dtype=dtype, **kwargs)
return astype_nansafe(data, dtype, copy=False)

def _values_for_factorize(self) -> Tuple[np.ndarray, float]:
# TODO: https://github.com/pandas-dev/pandas/issues/30037
# use masked algorithms, rather than object-dtype / np.nan.
return self.to_numpy(na_value=np.nan), np.nan

def _values_for_argsort(self) -> np.ndarray:
"""
Return values for sorting.
Expand Down
17 changes: 15 additions & 2 deletions pandas/core/arrays/masked.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,15 @@
from typing import TYPE_CHECKING, Optional, Type, TypeVar
from typing import TYPE_CHECKING, Optional, Tuple, Type, TypeVar

import numpy as np

from pandas._libs import lib, missing as libmissing
from pandas._typing import Scalar
from pandas.util._decorators import doc

from pandas.core.dtypes.common import is_integer, is_object_dtype, is_string_dtype
from pandas.core.dtypes.missing import isna, notna

from pandas.core.algorithms import take
from pandas.core.algorithms import _factorize_array, take
from pandas.core.arrays import ExtensionArray, ExtensionOpsMixin
from pandas.core.indexers import check_array_indexer

Expand Down Expand Up @@ -217,6 +218,18 @@ def copy(self: BaseMaskedArrayT) -> BaseMaskedArrayT:
mask = mask.copy()
return type(self)(data, mask, copy=False)

@doc(ExtensionArray.factorize)
def factorize(self, na_sentinel: int = -1) -> Tuple[np.ndarray, ExtensionArray]:
arr = self._data
mask = self._mask

codes, uniques = _factorize_array(arr, na_sentinel=na_sentinel, mask=mask)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason you want to call a private routine like this directly? shouldn't factorize just handle this directly? (isn't that the point of _values_for_factorize).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the way we also do it in the base EA factorize method.
The reason we are using this, and not pd.factorize directly, is because the public factorize does not support the additional na_value and mask keywords.

The point of _values_for_factorize is indeed to avoid that EA authors have to call this private _factorize_array method themselves (and to make it easier to implement EA.factorize), but here, I explicitly do not use the general _values_for_factorize path to be able to customize/optimize the IntegerArray/BooleaArray.factorize() method specifically for those dtypes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is really polluting the interface, I would much rather just add they keywords. It seems we are special casing EA to no end. This needs to stop.

The reason we are using this, and not pd.factorize directly, is because the public factorize does not support the additional na_value and mask keywords.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expanding the public interface of factorize is out of scope for this PR, IMO. The implementation I put here above is exactly how we already do it for 2 years (we are already using _factorize_array in our other EAs) . If you want to do a proposal to change this, please open an issue to discuss.


# the hashtables don't handle all different types of bits
uniques = uniques.astype(self.dtype.numpy_dtype, copy=False)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get a "pandas/core/arrays/masked.py:229: error: "ExtensionDtype" has no attribute "numpy_dtype"" mypy failure

cc @simonjayhawkins @WillAyd how can I solve / silence this? The numpy_dtype attribute is commong for Int/BoolDtype (so I can safely use it), but not for general ExtensionDtype.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yea so I guess complaining because as far as this class is defined, the return type of self.dtype is an ExtensionDtype (as defined in ExtensionArray)

I guess it comes back to the class design; if we have something else that inherits from BaseMaskedArray it could fail at runtime without if it isn't constructed to return a dtype from self.dtype that has a numpy_dtype attribute, which is a little hefty on the implicitness I guess

Copy link
Member Author

@jorisvandenbossche jorisvandenbossche Apr 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the subclasses IntegerArray and BooleanArray have a correctly typed dtype property. But this method above is defined in their parent class ..

In principle I could add a dtype property

@property
def dtype(self) -> Union["IntegerDtype", "BooleanDtype"]:
    pass

in the BaseMaskedArray class to solve this, I suppose?
But that is also kind of ugly, as the parent class shouldn't really know about its subclasses ..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right; I think it's going to be tough to make this work with mypy if we implicitly enforce that subclasses make dtype.numpy_dtype available

What does the comment directly preceding it refer to? Perhaps there is a way to do this without breaking the currently implied subclass requirements?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you override the type signature of IntegerArray.dtype to be IntegerDtype and BolleanArray.dtype to be BooleanDtype?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to just disable mypy on this line?

What does the comment directly preceding it refer to? Perhaps there is a way to do this without breaking the currently implied subclass requirements?

The hashtable is only implemented for int64. So if you have an int32 array, the unique values coming out of _factorize_array are int64, and need to be casted back to int32 (as the uniques returned from this method should be using the original dtype). So for this casting, I need to have access to the dtype's equivalent numpy dtype, which is avalaible as the numpy_dtype attribute.

I could do this differently by eg building up a mapping of EADtypes -> numpy dtypes and looking it up from there instead of using the attribute, but that would just be introducing more complex workarounds to just to satisfy mypy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about a MaskedArrayDtype that subclasses ExtensionDtype but has a numpy_dtype property?

Yes, that's probably the cleanest solution architecturally

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about a MaskedArrayDtype that subclasses ExtensionDtype but has a numpy_dtype property?

Yes, that's probably the cleanest solution architecturally.
But the dtype attribute on BaseMaskedArray would still only be a dummy property just to provide typing, since it is overwritten in both subclasses.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the dtype attribute on BaseMaskedArray would still only be a dummy property just to provide typing

I don't think you'll need to change anything on the array side. The dtypes will inherit from MaskedExtensionDtype, so mypy should know that integer_array.dtype.numpy_type is valid.

Or we add numpy_dtype to the ExtensionDtype API :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you'll need to change anything on the array side. The dtypes will inherit from MaskedExtensionDtype, so mypy should know that integer_array.dtype.numpy_type is valid.

No, since mypy thinks self.dtype is an ExtensionDtype, so having IntegerDtype/BooleanDtype inherit from a MaskedDtype that defines this attribute will no help.

So either we would indeed need to add numpy_dtype to the ExtensionDtype API, or I need to add a dummy dtype property on BaseMaskedArray to be able to type it as MaskedDtype.

uniques = type(self)(uniques, np.zeros(len(uniques), dtype=bool))
return codes, uniques

def value_counts(self, dropna: bool = True) -> "Series":
"""
Returns a Series containing counts of each unique value.
Expand Down
2 changes: 2 additions & 0 deletions pandas/tests/extension/base/methods.py
Original file line number Diff line number Diff line change
Expand Up @@ -133,6 +133,8 @@ def test_factorize_equivalence(self, data_for_grouping, na_sentinel):

tm.assert_numpy_array_equal(codes_1, codes_2)
self.assert_extension_array_equal(uniques_1, uniques_2)
assert len(uniques_1) == len(pd.unique(uniques_1))
assert uniques_1.dtype == data_for_grouping.dtype

def test_factorize_empty(self, data):
codes, uniques = pd.factorize(data[:0])
Expand Down