-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: algorithms.factorize moves null values when sort=False #46601
Conversation
np.array([0, 2, 1, 0], dtype=np.dtype("intp")), | ||
np.array(["a", "b", np.nan], dtype=object), | ||
np.array([0, 1, 2, 0], dtype=np.dtype("intp")), | ||
np.array(["a", None, "b"], dtype=object), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change from np.nan to None here was an unintended side-effect; is this undesirable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO I think this could be called a bug fix since now a user can preserve None in their input data when using pd.factorize
. Probably good to add a whatsnew note for this chage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I completely forgot about this open question. Will do.
8e66efb
to
670c2e8
Compare
@jbrockmendel - I've implemented the EA paths. There is a bit of a discrepancy between pd.factorize and EA's factorize (as well as factorize_array). Namely, the former can take I wonder if we should add |
…orize_na � Conflicts: � doc/source/whatsnew/v1.5.0.rst
@jbrockmendel - gentle ping. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
over to you @jbrockmendel
pandas/core/arrays/masked.py
Outdated
|
||
# check that factorize_array correctly preserves dtype. | ||
assert uniques.dtype == self.dtype.numpy_dtype, (uniques.dtype, self.dtype) | ||
|
||
uniques_ea = type(self)(uniques, np.zeros(len(uniques), dtype=bool)) | ||
# Make room for a null value if we're not ignoring it and it exists |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it make sense to share any of this with the ArrowArray version? not for this PR, but could have a TODO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, will add a TODO. Once we drop support for pyarrow < 4.0 we won't need this logic in ArrowArray, but 4.0 is only a year old at this point so that will be a while.
pandas/core/arrays/string_.py
Outdated
@classmethod | ||
def _from_factorized(cls, values, original): | ||
assert values.dtype == original._ndarray.dtype | ||
# When dropna (i.e. ignore_na) is False, can get -1 from nulls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any way we could avoid this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - yes. Changed _values_for_factorize from -1 to None.
pandas/core/groupby/grouper.py
Outdated
# we make a CategoricalIndex out of the cat grouper | ||
# preserving the categories / ordered attributes | ||
# preserving the categories / ordered attributes; | ||
# doesn't (yet) handle dropna=False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GH ref for the "yet"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened #46909, will add a reference in this comment.
…orize_na � Conflicts: � doc/source/whatsnew/v1.5.0.rst
…orize_na � Conflicts: � doc/source/whatsnew/v1.5.0.rst
@jbrockmendel - friendly ping |
doc/source/whatsnew/v1.5.0.rst
Outdated
@@ -278,6 +278,8 @@ Other enhancements | |||
- :meth:`DatetimeIndex.astype` now supports casting timezone-naive indexes to ``datetime64[s]``, ``datetime64[ms]``, and ``datetime64[us]``, and timezone-aware indexes to the corresponding ``datetime64[unit, tzname]`` dtypes (:issue:`47579`) | |||
- :class:`Series` reducers (e.g. ``min``, ``max``, ``sum``, ``mean``) will now successfully operate when the dtype is numeric and ``numeric_only=True`` is provided; previously this would raise a ``NotImplementedError`` (:issue:`47500`) | |||
- :meth:`RangeIndex.union` now can return a :class:`RangeIndex` instead of a :class:`Int64Index` if the resulting values are equally spaced (:issue:`47557`, :issue:`43885`) | |||
- The method :meth:`.ExtensionArray.factorize` can now be passed ``use_na_sentinel=False`` for determining how null values are to be treated. (:issue:`46601`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"be passed" -> "takes" or "accepts", i.e. avoid passive voice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - fixed.
Thanks for your patience. This is near the top of my queue. |
…orize_na � Conflicts: � doc/source/whatsnew/v1.5.0.rst
…orize_na � Conflicts: � doc/source/whatsnew/v1.5.0.rst � pandas/tests/groupby/test_groupby_dropna.py
# Avoid using catch_warnings when possible | ||
# GH#46910 - TimelikeOps has deprecated signature | ||
codes, uniques = values.factorize( # type: ignore[call-arg] | ||
use_na_sentinel=True | ||
use_na_sentinel=na_sentinel is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
im trying to follow this and keep getting lost here. is any of this going to get simpler once deprecations are enforced?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - but how much depends on what you mean by "this". There is a lot of playing around with arguments (what I think you're highlighting here) that will all go away. But what I regard as the main complication this PR introduces, noted in the TODO immediately below, will not go away just with the deprecation.
For that TODO, some changes to safe_sort
are necessary. For the bulk of cases the changes are straightforward and only require very small changes. However if you have multiple Python types in an object array (e.g. float and string) with np.nan and None, I haven't yet to find a good way to sort. I plan to revisit this in the next few days.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at safe_sort again, and @phofl already solved much of the issue in #47331. I opened a PR into this branch here to see what the execution of the TODO mentioned above would look like: rhshadrach#2
However, changing safe_sort in this way will induce another behavior change:
df = pd.DataFrame({'a': ['x', None, np.nan], 'b': [1, 2, 3]})
print(df.groupby('a', dropna=False).sum())
# main
b
a
x 1
NaN 5
# feature
b
a
x 1
None 2
NaN 3
No other tests besides the one changed failed locally. While I do think this is a bugfix, I'd like to study more what impact it has on concat/reshaping/indexing and I think it may need some discussion. In particular, I don't think it should be done in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel - friendly ping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I do think this is a bugfix, I'd like to study more what impact it has on concat/reshaping/indexing and I think it may need some discussion. In particular, I don't think it should be done in this PR.
Agreed both that the "feature" behavior looks more correct and that it should be done separate from this PR.
Taking another look now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Coming in fairly cold), I think I'm getting lost too here. I am not fully comprehending why we check na_sentinel == -1 or na_sentinel is None
and then use use_na_sentinel=na_sentinel is not None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea - there are some gymnastics here no doubt. The idea behind this block is to avoid using catch_warnings since it's possible.
Old API: na_sentinel is either an integer or None
New API: use_na_sentinel is False or True
The correspondence is:
use_na_sentinel False is equivalent to na_sentinel being None
use_na_sentinel True is equivalent to na_sentinel is -1
Note there is no option in the new API for na_sentinel being anything other than -1 or None in the old API. So we can use the new argument precisely when (a) the function has said argument and (b) na_sentinel is either -1 or None. In such a case, the correspondence from na_sentinel to use_na_sentinel is given by use_na_sentinel = na_sentinel is not None
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect, just the explanation I needed. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add this correspondence as a comment to the top of this function; we can remove it when the deprecation is enforced.
@mroeschke im flailing on reviewing this. can you tag in? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One whatsnew comment, one comprehension comment, merge conflict; otherwise, LGTM
…orize_na � Conflicts: � doc/source/whatsnew/v1.5.0.rst
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM just one merge conflict
…orize_na � Conflicts: � doc/source/whatsnew/v1.5.0.rst
Great! Thanks for the effort on this @rhshadrach |
…ev#46601) * BUG: algos.factorizes moves null values when sort=False * Implementation for pyarrow < 4.0 * fixup * fixups * test fixup * type-hints * improvements * remove override of _from_factorized in string_.py * Rework na_sentinel/dropna/ignore_na * fixups * fixup for pyarrow < 4.0 * whatsnew note * doc fixup * fixups * fixup whatsnew note * whatsnew note; comment on old vs new API Co-authored-by: asv-bot <pandas.benchmarks@gmail.com>
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.In the example below, the result_index has nan moved even though sort=False. This is the order that will be in any groupby reduction result and the reason why transform currently returns wrong results.
cc @jbrockmendel @jreback