Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: algorithms.factorize moves null values when sort=False #46601

Merged
merged 36 commits into from
Aug 18, 2022

Conversation

rhshadrach
Copy link
Member

@rhshadrach rhshadrach commented Apr 1, 2022

In the example below, the result_index has nan moved even though sort=False. This is the order that will be in any groupby reduction result and the reason why transform currently returns wrong results.

df = pd.DataFrame({'a': [1, 3, np.nan, 1, 2], 'b': [3, 4, 5, 6, 7]})
print(df.groupby('a', sort=False, dropna=False).grouper.result_index)

# main
Float64Index([1.0, 3.0, 2.0, nan], dtype='float64', name='a')
# this PR
Float64Index([1.0, 3.0, nan, 2.0], dtype='float64', name='a')

cc @jbrockmendel @jreback

@rhshadrach rhshadrach added Bug Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Apr 1, 2022
pandas/core/algorithms.py Outdated Show resolved Hide resolved
pandas/core/algorithms.py Outdated Show resolved Hide resolved
pandas/core/algorithms.py Outdated Show resolved Hide resolved
np.array([0, 2, 1, 0], dtype=np.dtype("intp")),
np.array(["a", "b", np.nan], dtype=object),
np.array([0, 1, 2, 0], dtype=np.dtype("intp")),
np.array(["a", None, "b"], dtype=object),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change from np.nan to None here was an unintended side-effect; is this undesirable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO I think this could be called a bug fix since now a user can preserve None in their input data when using pd.factorize. Probably good to add a whatsnew note for this chage.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I completely forgot about this open question. Will do.

@rhshadrach rhshadrach changed the title WIP/BUG: algorithms.factorize moves null values when sort=False BUG: algorithms.factorize moves null values when sort=False Apr 19, 2022
@rhshadrach
Copy link
Member Author

rhshadrach commented Apr 19, 2022

@jbrockmendel - I've implemented the EA paths. There is a bit of a discrepancy between pd.factorize and EA's factorize (as well as factorize_array). Namely, the former can take na_sentinel=None whereas the latter two have na_sentinel being integers. Because of this, I added the dropna to the EA factorize as well as factorize_array. Could also call this argument ignore_na, but I think that isn't used by the public API so dropna seemed better.

I wonder if we should add dropna to pd.factorize and deprecate na_sentinel being None there. Can also go the other route here and allow na_sentinel being None in EA's factorize; no strong opinion but I find na_sentinel=None meaning "Don't drop null values from uniques; and code nulls as -1" not very clear.

@rhshadrach rhshadrach marked this pull request as ready for review April 19, 2022 22:02
@jreback jreback added this to the 1.5 milestone Apr 26, 2022
@rhshadrach
Copy link
Member Author

@jbrockmendel - gentle ping.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

over to you @jbrockmendel


# check that factorize_array correctly preserves dtype.
assert uniques.dtype == self.dtype.numpy_dtype, (uniques.dtype, self.dtype)

uniques_ea = type(self)(uniques, np.zeros(len(uniques), dtype=bool))
# Make room for a null value if we're not ignoring it and it exists
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense to share any of this with the ArrowArray version? not for this PR, but could have a TODO

Copy link
Member Author

@rhshadrach rhshadrach Apr 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will add a TODO. Once we drop support for pyarrow < 4.0 we won't need this logic in ArrowArray, but 4.0 is only a year old at this point so that will be a while.

@classmethod
def _from_factorized(cls, values, original):
assert values.dtype == original._ndarray.dtype
# When dropna (i.e. ignore_na) is False, can get -1 from nulls
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any way we could avoid this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - yes. Changed _values_for_factorize from -1 to None.

# we make a CategoricalIndex out of the cat grouper
# preserving the categories / ordered attributes
# preserving the categories / ordered attributes;
# doesn't (yet) handle dropna=False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GH ref for the "yet"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #46909, will add a reference in this comment.

@rhshadrach
Copy link
Member Author

@jbrockmendel - friendly ping

@@ -278,6 +278,8 @@ Other enhancements
- :meth:`DatetimeIndex.astype` now supports casting timezone-naive indexes to ``datetime64[s]``, ``datetime64[ms]``, and ``datetime64[us]``, and timezone-aware indexes to the corresponding ``datetime64[unit, tzname]`` dtypes (:issue:`47579`)
- :class:`Series` reducers (e.g. ``min``, ``max``, ``sum``, ``mean``) will now successfully operate when the dtype is numeric and ``numeric_only=True`` is provided; previously this would raise a ``NotImplementedError`` (:issue:`47500`)
- :meth:`RangeIndex.union` now can return a :class:`RangeIndex` instead of a :class:`Int64Index` if the resulting values are equally spaced (:issue:`47557`, :issue:`43885`)
- The method :meth:`.ExtensionArray.factorize` can now be passed ``use_na_sentinel=False`` for determining how null values are to be treated. (:issue:`46601`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"be passed" -> "takes" or "accepts", i.e. avoid passive voice

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - fixed.

@jbrockmendel
Copy link
Member

friendly ping

Thanks for your patience. This is near the top of my queue.

…orize_na

� Conflicts:
�	doc/source/whatsnew/v1.5.0.rst
…orize_na

� Conflicts:
�	doc/source/whatsnew/v1.5.0.rst
�	pandas/tests/groupby/test_groupby_dropna.py
# Avoid using catch_warnings when possible
# GH#46910 - TimelikeOps has deprecated signature
codes, uniques = values.factorize( # type: ignore[call-arg]
use_na_sentinel=True
use_na_sentinel=na_sentinel is not None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im trying to follow this and keep getting lost here. is any of this going to get simpler once deprecations are enforced?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - but how much depends on what you mean by "this". There is a lot of playing around with arguments (what I think you're highlighting here) that will all go away. But what I regard as the main complication this PR introduces, noted in the TODO immediately below, will not go away just with the deprecation.

For that TODO, some changes to safe_sort are necessary. For the bulk of cases the changes are straightforward and only require very small changes. However if you have multiple Python types in an object array (e.g. float and string) with np.nan and None, I haven't yet to find a good way to sort. I plan to revisit this in the next few days.

Copy link
Member Author

@rhshadrach rhshadrach Jul 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at safe_sort again, and @phofl already solved much of the issue in #47331. I opened a PR into this branch here to see what the execution of the TODO mentioned above would look like: rhshadrach#2

However, changing safe_sort in this way will induce another behavior change:

df = pd.DataFrame({'a': ['x', None, np.nan], 'b': [1, 2, 3]})
print(df.groupby('a', dropna=False).sum())

# main
     b
a     
x    1
NaN  5

# feature
      b
a      
x     1
None  2
NaN   3

No other tests besides the one changed failed locally. While I do think this is a bugfix, I'd like to study more what impact it has on concat/reshaping/indexing and I think it may need some discussion. In particular, I don't think it should be done in this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel - friendly ping.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I do think this is a bugfix, I'd like to study more what impact it has on concat/reshaping/indexing and I think it may need some discussion. In particular, I don't think it should be done in this PR.

Agreed both that the "feature" behavior looks more correct and that it should be done separate from this PR.

Taking another look now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Coming in fairly cold), I think I'm getting lost too here. I am not fully comprehending why we check na_sentinel == -1 or na_sentinel is None and then use use_na_sentinel=na_sentinel is not None

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea - there are some gymnastics here no doubt. The idea behind this block is to avoid using catch_warnings since it's possible.

Old API: na_sentinel is either an integer or None
New API: use_na_sentinel is False or True

The correspondence is:

use_na_sentinel False is equivalent to na_sentinel being None
use_na_sentinel True is equivalent to na_sentinel is -1

Note there is no option in the new API for na_sentinel being anything other than -1 or None in the old API. So we can use the new argument precisely when (a) the function has said argument and (b) na_sentinel is either -1 or None. In such a case, the correspondence from na_sentinel to use_na_sentinel is given by use_na_sentinel = na_sentinel is not None.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, just the explanation I needed. Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add this correspondence as a comment to the top of this function; we can remove it when the deprecation is enforced.

@jbrockmendel
Copy link
Member

@mroeschke im flailing on reviewing this. can you tag in?

@mroeschke mroeschke mentioned this pull request Aug 15, 2022
Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One whatsnew comment, one comprehension comment, merge conflict; otherwise, LGTM

@rhshadrach rhshadrach requested a review from mroeschke August 17, 2022 12:00
Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM just one merge conflict

…orize_na

� Conflicts:
�	doc/source/whatsnew/v1.5.0.rst
@mroeschke mroeschke merged commit 56a7184 into pandas-dev:main Aug 18, 2022
@mroeschke
Copy link
Member

Great! Thanks for the effort on this @rhshadrach

@rhshadrach rhshadrach deleted the factorize_na branch September 10, 2022 16:49
noatamir pushed a commit to noatamir/pandas that referenced this pull request Nov 9, 2022
…ev#46601)

* BUG: algos.factorizes moves null values when sort=False

* Implementation for pyarrow < 4.0

* fixup

* fixups

* test fixup

* type-hints

* improvements

* remove override of _from_factorized in string_.py

* Rework na_sentinel/dropna/ignore_na

* fixups

* fixup for pyarrow < 4.0

* whatsnew note

* doc fixup

* fixups

* fixup whatsnew note

* whatsnew note; comment on old vs new API

Co-authored-by: asv-bot <pandas.benchmarks@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: groupby with nans always places nans last
4 participants