Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH/PERF: use mask in factorize for nullable dtypes #33064

Merged
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 19 additions & 2 deletions asv_bench/benchmarks/algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,16 @@ class Factorize:
params = [
[True, False],
[True, False],
["int", "uint", "float", "string", "datetime64[ns]", "datetime64[ns, tz]"],
[
"int",
"uint",
"float",
"string",
"datetime64[ns]",
"datetime64[ns, tz]",
"Int64",
"boolean",
],
]
param_names = ["unique", "sort", "dtype"]

Expand All @@ -49,13 +58,21 @@ def setup(self, unique, sort, dtype):
"datetime64[ns, tz]": pd.date_range(
"2011-01-01", freq="H", periods=N, tz="Asia/Tokyo"
),
"Int64": pd.array(np.arange(N), dtype="Int64"),
"boolean": pd.array(np.random.randint(0, 2, N), dtype="boolean"),
}[dtype]
if not unique:
data = data.repeat(5)
self.idx = data
if dtype in ("Int64", "boolean") and sort:
# sort is not a keyword on EAs
raise NotImplementedError

def time_factorize(self, unique, sort, dtype):
self.idx.factorize(sort=sort)
if sort:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this redudant? since sort is a parameter?

self.idx.factorize(sort=sort)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ExtensionArrays don't support the sort keyword, the other values are Index objects, which have that keyword.
So the tests for sort=True are skipped above in case of idx being an EA

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is very confusing then. I would separate the EAs out to a separate asv.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is very confusing then. I would separate the EAs out to a separate asv.

Agree this is confusing. But I switched to use the factorize function in the hope to make this clearer, and to keep a single benchmark (the index method is just simply calling pd.factorize on itself, so this should benchmark the exact same thing).
And that way, we can actually remove the skip for sort for EAs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback updated. Much better now I think (and fine for a single benchmark class/function)

self.idx.factorize(sort=sort)
else:
self.idx.factorize()


class Duplicated:
Expand Down
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -276,6 +276,7 @@ Performance improvements
:meth:`DataFrame.sparse.from_spmatrix` constructor (:issue:`32821`,
:issue:`32825`, :issue:`32826`, :issue:`32856`, :issue:`32858`).
- Performance improvement in :meth:`Series.sum` for nullable (integer and boolean) dtypes (:issue:`30982`).
- Performance improvement in :func:`factorize` for nullable (integer and boolean) dtypes (:issue:`33064`).


.. ---------------------------------------------------------------------------
Expand Down
35 changes: 28 additions & 7 deletions pandas/_libs/hashtable_class_helper.pxi.in
Original file line number Diff line number Diff line change
Expand Up @@ -368,7 +368,7 @@ cdef class {{name}}HashTable(HashTable):
def _unique(self, const {{dtype}}_t[:] values, {{name}}Vector uniques,
Py_ssize_t count_prior=0, Py_ssize_t na_sentinel=-1,
object na_value=None, bint ignore_na=False,
bint return_inverse=False):
object mask=None, bint return_inverse=False):
"""
Calculate unique values and labels (no sorting!)

Expand All @@ -391,6 +391,10 @@ cdef class {{name}}HashTable(HashTable):
Whether NA-values should be ignored for calculating the uniques. If
True, the labels corresponding to missing values will be set to
na_sentinel.
mask : ndarray[bool], optional
If not None, the mask is used as indicator for missing values
(True = missing, False = valid) instead of `na_value` or
condition "val != val".
return_inverse : boolean, default False
Whether the mapping of the original array values to their location
in the vector of uniques should be returned.
Expand All @@ -409,12 +413,17 @@ cdef class {{name}}HashTable(HashTable):
{{dtype}}_t val, na_value2
khiter_t k
{{name}}VectorData *ud
bint use_na_value
bint use_na_value, use_mask
uint8_t[:] mask_values

if return_inverse:
labels = np.empty(n, dtype=np.int64)
ud = uniques.data
use_na_value = na_value is not None
use_mask = mask is not None

if use_mask:
mask_values = mask.view("uint8")

if use_na_value:
# We need this na_value2 because we want to allow users
Expand All @@ -430,7 +439,11 @@ cdef class {{name}}HashTable(HashTable):
for i in range(n):
val = values[i]

if ignore_na and (
if ignore_na and use_mask:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we're worried about perf for existing cases, could take this check outside of the loop?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to check the mask for each value inside the loop, so not sure what can be moved outside?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was referring to the use_mask check; it would basically become a separate loop or even method

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think duplicating the full loop is worth it (the loop itself is 40 lines below here), given the minor performance impact I showed in the timings.

if mask_values[i]:
labels[i] = na_sentinel
continue
elif ignore_na and (
{{if not name.lower().startswith(("uint", "int"))}}
val != val or
{{endif}}
Expand Down Expand Up @@ -494,7 +507,7 @@ cdef class {{name}}HashTable(HashTable):
return_inverse=return_inverse)

def factorize(self, const {{dtype}}_t[:] values, Py_ssize_t na_sentinel=-1,
object na_value=None):
object na_value=None, object mask=None):
"""
Calculate unique values and labels (no sorting!)

Expand All @@ -512,6 +525,10 @@ cdef class {{name}}HashTable(HashTable):
any value "val" satisfying val != val is considered missing.
If na_value is not None, then _additionally_, any value "val"
satisfying val == na_value is considered missing.
mask : ndarray[bool], optional
If not None, the mask is used as indicator for missing values
(True = missing, False = valid) instead of `na_value` or
condition "val != val".

Returns
-------
Expand All @@ -522,7 +539,7 @@ cdef class {{name}}HashTable(HashTable):
"""
uniques_vector = {{name}}Vector()
return self._unique(values, uniques_vector, na_sentinel=na_sentinel,
na_value=na_value, ignore_na=True,
na_value=na_value, ignore_na=True, mask=mask,
return_inverse=True)

def get_labels(self, const {{dtype}}_t[:] values, {{name}}Vector uniques,
Expand Down Expand Up @@ -855,7 +872,7 @@ cdef class StringHashTable(HashTable):
return_inverse=return_inverse)

def factorize(self, ndarray[object] values, Py_ssize_t na_sentinel=-1,
object na_value=None):
object na_value=None, object mask=None):
"""
Calculate unique values and labels (no sorting!)

Expand All @@ -873,6 +890,8 @@ cdef class StringHashTable(HashTable):
that is not a string is considered missing. If na_value is
not None, then _additionally_ any value "val" satisfying
val == na_value is considered missing.
mask : ndarray[bool], optional
Not yet implementd for StringHashTable.

Returns
-------
Expand Down Expand Up @@ -1094,7 +1113,7 @@ cdef class PyObjectHashTable(HashTable):
return_inverse=return_inverse)

def factorize(self, ndarray[object] values, Py_ssize_t na_sentinel=-1,
object na_value=None):
object na_value=None, object mask=None):
"""
Calculate unique values and labels (no sorting!)

Expand All @@ -1112,6 +1131,8 @@ cdef class PyObjectHashTable(HashTable):
any value "val" satisfying val != val is considered missing.
If na_value is not None, then _additionally_, any value "val"
satisfying val == na_value is considered missing.
mask : ndarray[bool], optional
Not yet implemented for PyObjectHashTable.

Returns
-------
Expand Down
10 changes: 8 additions & 2 deletions pandas/core/algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -455,7 +455,7 @@ def isin(comps: AnyArrayLike, values: AnyArrayLike) -> np.ndarray:


def _factorize_array(
values, na_sentinel: int = -1, size_hint=None, na_value=None
values, na_sentinel: int = -1, size_hint=None, na_value=None, mask=None,
) -> Tuple[np.ndarray, np.ndarray]:
"""
Factorize an array-like to codes and uniques.
Expand All @@ -473,6 +473,10 @@ def _factorize_array(
parameter when you know that you don't have any values pandas would
consider missing in the array (NaN for float data, iNaT for
datetimes, etc.).
mask : ndarray[bool], optional
If not None, the mask is used as indicator for missing values
(True = missing, False = valid) instead of `na_value` or
condition "val != val".

Returns
-------
Expand All @@ -482,7 +486,9 @@ def _factorize_array(
hash_klass, values = _get_data_algo(values)

table = hash_klass(size_hint or len(values))
uniques, codes = table.factorize(values, na_sentinel=na_sentinel, na_value=na_value)
uniques, codes = table.factorize(
values, na_sentinel=na_sentinel, na_value=na_value, mask=mask
)

codes = ensure_platform_int(codes)
return codes, uniques
Expand Down
13 changes: 4 additions & 9 deletions pandas/core/arrays/boolean.py
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,10 @@ def type(self) -> Type[np.bool_]:
def kind(self) -> str:
return "b"

@property
def numpy_dtype(self) -> np.dtype:
return np.dtype("bool")

@classmethod
def construct_array_type(cls) -> Type["BooleanArray"]:
"""
Expand Down Expand Up @@ -314,15 +318,6 @@ def map_string(s):
scalars = [map_string(x) for x in strings]
return cls._from_sequence(scalars, dtype, copy)

def _values_for_factorize(self) -> Tuple[np.ndarray, int]:
data = self._data.astype("int8")
data[self._mask] = -1
return data, -1

@classmethod
def _from_factorized(cls, values, original: "BooleanArray") -> "BooleanArray":
return cls._from_sequence(values, dtype=original.dtype)

_HANDLED_TYPES = (np.ndarray, numbers.Number, bool, np.bool_)

def __array_ufunc__(self, ufunc, method: str, *inputs, **kwargs):
Expand Down
9 changes: 0 additions & 9 deletions pandas/core/arrays/integer.py
Original file line number Diff line number Diff line change
Expand Up @@ -366,10 +366,6 @@ def _from_sequence_of_strings(
scalars = to_numeric(strings, errors="raise")
return cls._from_sequence(scalars, dtype, copy)

@classmethod
def _from_factorized(cls, values, original) -> "IntegerArray":
return integer_array(values, dtype=original.dtype)

_HANDLED_TYPES = (np.ndarray, numbers.Number)

def __array_ufunc__(self, ufunc, method: str, *inputs, **kwargs):
Expand Down Expand Up @@ -479,11 +475,6 @@ def astype(self, dtype, copy: bool = True) -> ArrayLike:
data = self.to_numpy(dtype=dtype, **kwargs)
return astype_nansafe(data, dtype, copy=False)

def _values_for_factorize(self) -> Tuple[np.ndarray, float]:
# TODO: https://github.com/pandas-dev/pandas/issues/30037
# use masked algorithms, rather than object-dtype / np.nan.
return self.to_numpy(na_value=np.nan), np.nan

def _values_for_argsort(self) -> np.ndarray:
"""
Return values for sorting.
Expand Down
17 changes: 15 additions & 2 deletions pandas/core/arrays/masked.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,15 @@
from typing import TYPE_CHECKING, Optional, Type, TypeVar
from typing import TYPE_CHECKING, Optional, Tuple, Type, TypeVar

import numpy as np

from pandas._libs import lib, missing as libmissing
from pandas._typing import Scalar
from pandas.util._decorators import doc

from pandas.core.dtypes.common import is_integer, is_object_dtype, is_string_dtype
from pandas.core.dtypes.missing import isna, notna

from pandas.core.algorithms import take
from pandas.core.algorithms import _factorize_array, take
from pandas.core.arrays import ExtensionArray, ExtensionOpsMixin
from pandas.core.indexers import check_array_indexer

Expand Down Expand Up @@ -217,6 +218,18 @@ def copy(self: BaseMaskedArrayT) -> BaseMaskedArrayT:
mask = mask.copy()
return type(self)(data, mask, copy=False)

@doc(ExtensionArray.factorize)
def factorize(self, na_sentinel: int = -1) -> Tuple[np.ndarray, ExtensionArray]:
arr = self._data
mask = self._mask

codes, uniques = _factorize_array(arr, na_sentinel=na_sentinel, mask=mask)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason you want to call a private routine like this directly? shouldn't factorize just handle this directly? (isn't that the point of _values_for_factorize).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the way we also do it in the base EA factorize method.
The reason we are using this, and not pd.factorize directly, is because the public factorize does not support the additional na_value and mask keywords.

The point of _values_for_factorize is indeed to avoid that EA authors have to call this private _factorize_array method themselves (and to make it easier to implement EA.factorize), but here, I explicitly do not use the general _values_for_factorize path to be able to customize/optimize the IntegerArray/BooleaArray.factorize() method specifically for those dtypes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is really polluting the interface, I would much rather just add they keywords. It seems we are special casing EA to no end. This needs to stop.

The reason we are using this, and not pd.factorize directly, is because the public factorize does not support the additional na_value and mask keywords.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expanding the public interface of factorize is out of scope for this PR, IMO. The implementation I put here above is exactly how we already do it for 2 years (we are already using _factorize_array in our other EAs) . If you want to do a proposal to change this, please open an issue to discuss.


# the hashtables don't handle all different types of bits
uniques = uniques.astype(self.dtype.numpy_dtype, copy=False) # type: ignore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer not to type: ignore this; I find the current hierarchy of things and the implicit requirements a little wonky

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am happy to do a PR that tries to rework the class hierarchy, but can that be done separately? Because it is really unrelated to the rest of this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is related - this PR introduces an implicit requirement in factorize that self.dtype has a numpy_dtype attribute which. I think mypy is being smart to flag that; would prefer again to not ignore the advice

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I added a minimal BaseMaskedDtype class to ensure typing is correct. Can you take a look?

There is no other way to indicate the types to mypy than adding yet another abstract property as I did? (the base ExtensionArray class also has an abstract dtype property, typed as ExtensionDtype, so I am override an abstract property with a new abstract property, just to have the different return type annotation)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think mypy is right about this one. It only knows (or knew prior to your commit) that BaseMasedArray.dtype is ExtensionDtype, which doesn't necessarily have a .numpy_dtype attribute.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, wondering, what is the impact of adding such Generic to the actual class inheritance?

The Generic base class uses a metaclass that defines getitem. types are erased at runtime.

Using generic classes (parameterized or not) to access attributes will result in type check failure. Outside the class definition body, a class attribute cannot be assigned, and can only be looked up by accessing it through a class instance that does not have an instance attribute with the same name

I believe the above restrictions applies to 3.6 as Generic was changed in 3.7 to not use a custom metaclass. see https://docs.python.org/3/library/typing.html#user-defined-generic-types

In the python docs, as far as I can see, Generic is only used to create user defined type variables that then can be used in annotations, but not for subclassing actual classes.

see https://www.python.org/dev/peps/pep-0484/#instantiating-generic-classes-and-type-erasure

Copy link
Member Author

@jorisvandenbossche jorisvandenbossche May 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonjayhawkins since I don't fully grasp the generic, yet, I updated it with a simpler abstract dtype property (see the last added commit), which also seems to fix it. Or you OK with this, for now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonjayhawkins since I don't fully grasp the generic, yet, I updated it with a simpler abstract dtype property (see the last added commit), which also seems to fix it. Or you OK with this, for now?

I'll look in a short while.

I know there is significant resistance to Generic, but it would allow say ExtensionArray[PeriodDtype] to be equivalent to PeriodArray and then AnyArrayLike could then be parameterised to say AnyArrayLike[PeriodDtype]. I suspect at some point we will want to refine types. but that's a discussion for another day.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am certainly open to going that way if gives better typing in general for EAs and solves a bunch of things. But maybe that can then go into a separate PR (at least I am not doing a # type: ignore anymore ;-))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or you OK with this, for now?

LGTM

But maybe that can then go into a separate PR

no problem. Generic (and Protocol) are the two typing additions that actually could potentially alter runtime behaviour, so it is right that we scrutinize this. The only problem i've encountered is the repr of type(<instance>) in an assert statement in py3.6, where the instance is an instance of a Generic Class. xref #31574

at least I am not doing a # type: ignore anymore ;-)

👍

uniques = type(self)(uniques, np.zeros(len(uniques), dtype=bool))
return codes, uniques

def value_counts(self, dropna: bool = True) -> "Series":
"""
Returns a Series containing counts of each unique value.
Expand Down
2 changes: 2 additions & 0 deletions pandas/tests/extension/base/methods.py
Original file line number Diff line number Diff line change
Expand Up @@ -133,6 +133,8 @@ def test_factorize_equivalence(self, data_for_grouping, na_sentinel):

tm.assert_numpy_array_equal(codes_1, codes_2)
self.assert_extension_array_equal(uniques_1, uniques_2)
assert len(uniques_1) == len(pd.unique(uniques_1))
assert uniques_1.dtype == data_for_grouping.dtype

def test_factorize_empty(self, data):
codes, uniques = pd.factorize(data[:0])
Expand Down