-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH/PERF: use mask in factorize for nullable dtypes #33064
ENH/PERF: use mask in factorize for nullable dtypes #33064
Conversation
Additional notes:
|
pandas/core/arrays/integer.py
Outdated
return self.to_numpy(na_value=np.nan), np.nan | ||
return self.to_numpy(dtype=float, na_value=np.nan), np.nan | ||
|
||
def factorize2(self, na_sentinel: int = -1) -> Tuple[np.ndarray, "ExtensionArray"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is of course not meant to stay. I think we have two options:
- Override the base class
factorize
here on IntegerArray to use the mask (so basically renamefactorize2
tofactorize
) - Expand the
_values_for_factorize
spec to allow it to return a mask instead of na_value. The base classfactorize
then would need to choose the correct path depending on whether_values_for_factorize
returned a boolean ndarray, or something else (scalar na_value)
Short term, the first is the easiest. But long term, I think the second would be nice to allow external EAs to more easily use this as well (avoiding the need they have to override the base class factorize
as well). But this second option has some more implications (need to update everywhere it is used internally).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think that expanding _values_for_factorize
is the way to go, but I worry about inferring the EA wants a masked based on na_value
being a boolean ndarray. It's an edge case, but consider a nested / ragged array where each "scalar" element is an ndarray.
Could we instead (or also) have a class attribute like _masked_factorize = True
indicating that they want to used the masked variant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's an edge case, but consider a nested / ragged array where each "scalar" element is an ndarray.
Even in such a case, I would expect the na_value not to be an array. But OK, it's certainly not the most robust way.
An alternative could also be to return 3 values (values, None, mask) in case of a mask, and then we can check the number of items returned.
Also not the cleanest solution, though. I just also don't really like the _masked_factorize
attribute .. Certainly if we would start doing this in other places as well, and then might need other similar attributes. Unless we would go for a single _is_masked
attribute and then you need to go all or nothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
option 2 (return a mask) would be my preference, even if its a breaking change, much cleaner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should simply break this (we would break other projects like GeoPandas), since it is relatively easily avoidable.
If we want to replace na_value
in factorize with passing a mask (with a given na_value
you can compute the mask in advance with values == na_value
), we could deprecate the option of returning an na_value
from _values_for_factorize
, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code change looks good. idealy we can just use masks rather than na_value in internal factorize / unique. though that might be some work (maybe?). e.g. easy enough to construct a mask.
this PR? or followon?
pandas/core/arrays/integer.py
Outdated
return self.to_numpy(na_value=np.nan), np.nan | ||
return self.to_numpy(dtype=float, na_value=np.nan), np.nan | ||
|
||
def factorize2(self, na_sentinel: int = -1) -> Tuple[np.ndarray, "ExtensionArray"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
option 2 (return a mask) would be my preference, even if its a breaking change, much cleaner
It would simplify the hashtable, but I am not sure it would always be beneficial performance-wise. I would need to do some testing to compare pre-computing the mask vs the current implementation to check na_value for dtypes that don't already have the mask available. And we still have the |
yes its actually easy to do this, just pass |
@@ -430,7 +439,11 @@ cdef class {{name}}HashTable(HashTable): | |||
for i in range(n): | |||
val = values[i] | |||
|
|||
if ignore_na and ( | |||
if ignore_na and use_mask: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we're worried about perf for existing cases, could take this check outside of the loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to check the mask for each value inside the loop, so not sure what can be moved outside?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was referring to the use_mask check; it would basically become a separate loop or even method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think duplicating the full loop is worth it (the loop itself is 40 lines below here), given the minor performance impact I showed in the timings.
How common do we think it will be for 3rd part EAs to want to use a mask-based implementation? e.g. i expect fletcher will? If it's just going to be two internal EAs + fletcher, better to just override (more generally, based on other threads, I now think we should get rid of
That becomes a 2-pass algorithm, plus allocates a new array. The latter bothers me more than the former. |
fletcher doesn't implement How many external projects are otherwise using
Yes, I think that is a good reason to not do this universally. |
needs a rebase |
|
codes, uniques = _factorize_array(arr, na_sentinel=na_sentinel, mask=mask) | ||
|
||
# the hashtables don't handle all different types of bits | ||
uniques = uniques.astype(self.dtype.numpy_dtype, copy=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get a "pandas/core/arrays/masked.py:229: error: "ExtensionDtype" has no attribute "numpy_dtype"" mypy failure
cc @simonjayhawkins @WillAyd how can I solve / silence this? The numpy_dtype
attribute is commong for Int/BoolDtype (so I can safely use it), but not for general ExtensionDtype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm yea so I guess complaining because as far as this class is defined, the return type of self.dtype
is an ExtensionDtype (as defined in ExtensionArray
)
I guess it comes back to the class design; if we have something else that inherits from BaseMaskedArray it could fail at runtime without if it isn't constructed to return a dtype from self.dtype
that has a numpy_dtype
attribute, which is a little hefty on the implicitness I guess
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the subclasses IntegerArray and BooleanArray have a correctly typed dtype
property. But this method above is defined in their parent class ..
In principle I could add a dtype
property
@property
def dtype(self) -> Union["IntegerDtype", "BooleanDtype"]:
pass
in the BaseMaskedArray class to solve this, I suppose?
But that is also kind of ugly, as the parent class shouldn't really know about its subclasses ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right; I think it's going to be tough to make this work with mypy if we implicitly enforce that subclasses make dtype.numpy_dtype
available
What does the comment directly preceding it refer to? Perhaps there is a way to do this without breaking the currently implied subclass requirements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you override the type signature of IntegerArray.dtype to be IntegerDtype
and BolleanArray.dtype to be BooleanDtype?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to just disable mypy on this line?
What does the comment directly preceding it refer to? Perhaps there is a way to do this without breaking the currently implied subclass requirements?
The hashtable is only implemented for int64. So if you have an int32 array, the unique values coming out of _factorize_array
are int64, and need to be casted back to int32 (as the uniques returned from this method should be using the original dtype). So for this casting, I need to have access to the dtype's equivalent numpy dtype, which is avalaible as the numpy_dtype
attribute.
I could do this differently by eg building up a mapping of EADtypes -> numpy dtypes and looking it up from there instead of using the attribute, but that would just be introducing more complex workarounds to just to satisfy mypy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about a MaskedArrayDtype that subclasses ExtensionDtype but has a numpy_dtype property?
Yes, that's probably the cleanest solution architecturally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about a MaskedArrayDtype that subclasses ExtensionDtype but has a numpy_dtype property?
Yes, that's probably the cleanest solution architecturally.
But the dtype
attribute on BaseMaskedArray would still only be a dummy property just to provide typing, since it is overwritten in both subclasses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the dtype attribute on BaseMaskedArray would still only be a dummy property just to provide typing
I don't think you'll need to change anything on the array side. The dtypes will inherit from MaskedExtensionDtype, so mypy should know that integer_array.dtype.numpy_type
is valid.
Or we add numpy_dtype
to the ExtensionDtype
API :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you'll need to change anything on the array side. The dtypes will inherit from MaskedExtensionDtype, so mypy should know that integer_array.dtype.numpy_type is valid.
No, since mypy thinks self.dtype
is an ExtensionDtype, so having IntegerDtype/BooleanDtype inherit from a MaskedDtype that defines this attribute will no help.
So either we would indeed need to add numpy_dtype
to the ExtensionDtype API, or I need to add a dummy dtype
property on BaseMaskedArray to be able to type it as MaskedDtype.
asv_bench/benchmarks/algorithms.py
Outdated
|
||
def time_factorize(self, unique, sort, dtype): | ||
self.idx.factorize(sort=sort) | ||
if sort: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this redudant? since sort is a parameter?
self.idx.factorize(sort=sort)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ExtensionArrays don't support the sort
keyword, the other values are Index objects, which have that keyword.
So the tests for sort=True
are skipped above in case of idx
being an EA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is very confusing then. I would separate the EAs out to a separate asv.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is very confusing then. I would separate the EAs out to a separate asv.
Agree this is confusing. But I switched to use the factorize function in the hope to make this clearer, and to keep a single benchmark (the index method is just simply calling pd.factorize on itself, so this should benchmark the exact same thing).
And that way, we can actually remove the skip for sort
for EAs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback updated. Much better now I think (and fine for a single benchmark class/function)
arr = self._data | ||
mask = self._mask | ||
|
||
codes, uniques = _factorize_array(arr, na_sentinel=na_sentinel, mask=mask) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason you want to call a private routine like this directly? shouldn't factorize just handle this directly? (isn't that the point of _values_for_factorize).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the way we also do it in the base EA factorize
method.
The reason we are using this, and not pd.factorize
directly, is because the public factorize does not support the additional na_value
and mask
keywords.
The point of _values_for_factorize
is indeed to avoid that EA authors have to call this private _factorize_array
method themselves (and to make it easier to implement EA.factorize
), but here, I explicitly do not use the general _values_for_factorize
path to be able to customize/optimize the IntegerArray/BooleaArray.factorize()
method specifically for those dtypes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is really polluting the interface, I would much rather just add they keywords. It seems we are special casing EA to no end. This needs to stop.
The reason we are using this, and not pd.factorize directly, is because the public factorize does not support the additional na_value and mask keywords.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Expanding the public interface of factorize
is out of scope for this PR, IMO. The implementation I put here above is exactly how we already do it for 2 years (we are already using _factorize_array
in our other EAs) . If you want to do a proposal to change this, please open an issue to discuss.
asv_bench/benchmarks/algorithms.py
Outdated
|
||
def time_factorize(self, unique, sort, dtype): | ||
self.idx.factorize(sort=sort) | ||
if sort: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is very confusing then. I would separate the EAs out to a separate asv.
arr = self._data | ||
mask = self._mask | ||
|
||
codes, uniques = _factorize_array(arr, na_sentinel=na_sentinel, mask=mask) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is really polluting the interface, I would much rather just add they keywords. It seems we are special casing EA to no end. This needs to stop.
The reason we are using this, and not pd.factorize directly, is because the public factorize does not support the additional na_value and mask keywords.
pandas/core/arrays/masked.py
Outdated
codes, uniques = _factorize_array(arr, na_sentinel=na_sentinel, mask=mask) | ||
|
||
# the hashtables don't handle all different types of bits | ||
uniques = uniques.astype(self.dtype.numpy_dtype, copy=False) # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer not to type: ignore this; I find the current hierarchy of things and the implicit requirements a little wonky
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am happy to do a PR that tries to rework the class hierarchy, but can that be done separately? Because it is really unrelated to the rest of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is related - this PR introduces an implicit requirement in factorize that self.dtype
has a numpy_dtype
attribute which. I think mypy is being smart to flag that; would prefer again to not ignore the advice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I added a minimal BaseMaskedDtype class to ensure typing is correct. Can you take a look?
There is no other way to indicate the types to mypy than adding yet another abstract property as I did? (the base ExtensionArray class also has an abstract dtype
property, typed as ExtensionDtype
, so I am override an abstract property with a new abstract property, just to have the different return type annotation)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think mypy is right about this one. It only knows (or knew prior to your commit) that BaseMasedArray.dtype is ExtensionDtype, which doesn't necessarily have a .numpy_dtype
attribute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, wondering, what is the impact of adding such Generic to the actual class inheritance?
The Generic base class uses a metaclass that defines getitem. types are erased at runtime.
Using generic classes (parameterized or not) to access attributes will result in type check failure. Outside the class definition body, a class attribute cannot be assigned, and can only be looked up by accessing it through a class instance that does not have an instance attribute with the same name
I believe the above restrictions applies to 3.6 as Generic was changed in 3.7 to not use a custom metaclass. see https://docs.python.org/3/library/typing.html#user-defined-generic-types
In the python docs, as far as I can see,
Generic
is only used to create user defined type variables that then can be used in annotations, but not for subclassing actual classes.
see https://www.python.org/dev/peps/pep-0484/#instantiating-generic-classes-and-type-erasure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simonjayhawkins since I don't fully grasp the generic, yet, I updated it with a simpler abstract dtype property (see the last added commit), which also seems to fix it. Or you OK with this, for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simonjayhawkins since I don't fully grasp the generic, yet, I updated it with a simpler abstract dtype property (see the last added commit), which also seems to fix it. Or you OK with this, for now?
I'll look in a short while.
I know there is significant resistance to Generic, but it would allow say ExtensionArray[PeriodDtype] to be equivalent to PeriodArray and then AnyArrayLike could then be parameterised to say AnyArrayLike[PeriodDtype]. I suspect at some point we will want to refine types. but that's a discussion for another day.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am certainly open to going that way if gives better typing in general for EAs and solves a bunch of things. But maybe that can then go into a separate PR (at least I am not doing a # type: ignore
anymore ;-))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or you OK with this, for now?
LGTM
But maybe that can then go into a separate PR
no problem. Generic (and Protocol) are the two typing additions that actually could potentially alter runtime behaviour, so it is right that we scrutinize this. The only problem i've encountered is the repr of type(<instance>) in an assert statement in py3.6, where the instance is an instance of a Generic Class. xref #31574
at least I am not doing a
# type: ignore
anymore ;-)
👍
Were there other remaining comments (apart from the mypy issues) ? |
Given no further comments, going to merge this. |
xref #30037
This adds the option to use a mask in the
HashTable.factorize
(implementation itself is inHashTable._unique
).That allows eg IntegerArray to use its mask and avoid the need to convert to object dtype or float dtype (which also avoids a copy of the data), and gives a nice speed-up (so checking for na_value or nan, only checking the mask):
Small example (for easier testing purposes I just added
factorize2
that uses the mask, vs existingfactorize
, but that is not meant to stay of course):Since this is adding an extra if branch to
factorize
, this can impact existing cases that don't use the mask, but I didn't see any noticeable effect (with few times switching between branches and rebuilding):So for the int array there seems to be a small slowdown (as expected), although it is still within the +/- bounds. And I would say it is also small enough (around 2-3%) to be OK with this slowdown given the benefits of being able to use the mask for other cases.