Skip to content

Value counts normalize #33652

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
bd9011a
made nan count when dropna=False
DataInformer Apr 18, 2020
d9d5ec1
updated changelog
DataInformer Apr 18, 2020
86fe7f9
trivial
DataInformer Apr 18, 2020
c34a863
added specific test for groupby valuecount interval fix
DataInformer Apr 18, 2020
9c1c269
merge master
DataInformer Apr 18, 2020
5f8eb1d
updated value_count docstrings
DataInformer Apr 19, 2020
1276166
fixed pep8 style
DataInformer Apr 19, 2020
a1b7197
fixed more minor style
DataInformer Apr 19, 2020
9c3ede3
added test for na in bins
DataInformer Apr 20, 2020
0cff92b
added release notes to 1.1
DataInformer Apr 20, 2020
27aa460
trying to avoid docstring warning
DataInformer Apr 25, 2020
27c9856
trying to avoid docstring warning
DataInformer Apr 25, 2020
f5e9aeb
include nan count when dropna=False
DataInformer Jun 27, 2020
99b7112
listed bugfix
DataInformer Jun 27, 2020
75374b2
avoided tests that highlight groupby.value_count bug
DataInformer Jun 27, 2020
25b6c14
Revert "avoided tests that highlight groupby.value_count bug"
DataInformer Jul 4, 2020
277ce52
use series value_counts for groupby
DataInformer Jul 4, 2020
73ef54b
Merge branch 'master' of https://github.com/pandas-dev/pandas
DataInformer Jul 4, 2020
6b97e0b
Merge branch 'master' into value_counts_part1
DataInformer Jul 4, 2020
797f668
added groupby bin test
DataInformer Jul 4, 2020
fce6998
passing groupy valcount tests
DataInformer Jul 16, 2020
637a609
nan doesnt work for times
DataInformer Jul 26, 2020
c9a4383
passing all value count tests
DataInformer Jul 27, 2020
7ae1280
Merge remote-tracking branch 'upstream/master' into value_counts_part1
DataInformer Jul 27, 2020
d2399ea
speedups 1
DataInformer Jul 27, 2020
3299a36
Merge branch 'value_counts_part1' into value_counts_normalize
DataInformer Jul 27, 2020
ec92f15
speedup?
DataInformer Aug 2, 2020
d6179b0
Revert "speedup?"
DataInformer Aug 2, 2020
5abfb16
fixed comments
DataInformer Aug 10, 2020
83ccfd2
Merge remote-tracking branch 'upstream/master' into value_counts_norm…
DataInformer Aug 30, 2020
5f33834
removed unneeded import
DataInformer Sep 12, 2020
8562f1b
Merge remote-tracking branch 'upstream/master' into value_counts_norm…
DataInformer Sep 12, 2020
f685cb2
updated to use na_sentinal param
DataInformer Sep 12, 2020
c21bdbb
fixed bad test
DataInformer Sep 12, 2020
e4c2552
moved doc, reverted permissions
DataInformer Sep 13, 2020
74b13d8
more doc and permission fix
DataInformer Sep 13, 2020
f0e630a
fixed docstrings
DataInformer Sep 14, 2020
9763e83
file perm
DataInformer Sep 14, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions doc/source/whatsnew/v1.2.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -247,7 +247,8 @@ Numeric
^^^^^^^
- Bug in :func:`to_numeric` where float precision was incorrect (:issue:`31364`)
- Bug in :meth:`DataFrame.any` with ``axis=1`` and ``bool_only=True`` ignoring the ``bool_only`` keyword (:issue:`32432`)
-
- Bug in :meth:`Series.value_counts` with ``dropna=True`` and ``normalize=True`` where value counts did not sum to 1. (:issue:`25970`)


Conversion
^^^^^^^^^^
Expand Down Expand Up @@ -315,7 +316,7 @@ Groupby/resample/rolling
- Bug in :meth:`DataFrameGroupby.tshift` failing to raise ``ValueError`` when a frequency cannot be inferred for the index of a group (:issue:`35937`)
- Bug in :meth:`DataFrame.groupby` does not always maintain column index name for ``any``, ``all``, ``bfill``, ``ffill``, ``shift`` (:issue:`29764`)
- Bug in :meth:`DataFrameGroupBy.apply` raising error with ``np.nan`` group(s) when ``dropna=False`` (:issue:`35889`)
-
- Bug in :meth:`DataframeGroupBy.value_counts` outputs wrong index labels with bins (:issue:`32471`)

Reshaping
^^^^^^^^^
Expand Down
37 changes: 17 additions & 20 deletions pandas/core/algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -720,17 +720,23 @@ def value_counts(
ascending : bool, default False
Sort in ascending order
normalize: bool, default False
If True then compute a relative histogram
bins : integer, optional
Rather than count values, group them into half-open bins,
convenience for pd.cut, only works with numeric data
If True, then compute a relative histogram that outputs the
proportion of each value.
bins : integer or iterable of numeric, optional
Rather than count values, group them into half-open bins.
Only works with numeric data.
If int, interpreted as number of bins.
If interable of numeric, will use provided numbers as bin endpoints.
dropna : bool, default True
Don't include counts of NaN
Don't include counts of NaN.
If False and NaNs are present, NaN will be a key in the output.
.. versionchanged:: 1.2

Returns
-------
Series
"""

from pandas.core.series import Series

name = getattr(values, "name", None)
Expand All @@ -744,39 +750,30 @@ def value_counts(
except TypeError as err:
raise TypeError("bins argument only works with numeric data.") from err

# count, remove nulls (from the index), and but the bins
# count, remove nulls (from the index), and use the bins
result = ii.value_counts(dropna=dropna)
result = result[result.index.notna()]
result.index = result.index.astype("interval")
result = result.sort_index()

# if we are dropna and we have NO values
if dropna and (result._values == 0).all():
result = result.iloc[0:0]

# normalizing is by len of all (regardless of dropna)
counts = np.array([len(ii)])

else:

if is_extension_array_dtype(values):

# handle Categorical and sparse,
# handle Categorical and sparse data,
result = Series(values)._values.value_counts(dropna=dropna)
result.name = name
counts = result._values

else:
keys, counts = value_counts_arraylike(values, dropna)

result = Series(counts, index=keys, name=name)

if sort:
result = result.sort_values(ascending=ascending)

if normalize:
result = result / float(counts.sum())
counts = result._values
result = result / float(max(counts.sum(), 1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like part of this PR is about normalize and another part is about dropna. Could these be split into independent pieces?


if sort:
result = result.sort_values(ascending=ascending)
return result


Expand Down
29 changes: 22 additions & 7 deletions pandas/core/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -1174,17 +1174,20 @@ def value_counts(
Parameters
----------
normalize : bool, default False
If True then the object returned will contain the relative
frequencies of the unique values.
If True, outputs the relative frequencies of the unique values.
sort : bool, default True
Sort by frequencies.
ascending : bool, default False
Sort in ascending order.
bins : int, optional
Rather than count values, group them into half-open bins,
a convenience for ``pd.cut``, only works with numeric data.
bins : integer or iterable of numeric, optional
Rather than count individual values, group them into half-open bins.
Only works with numeric data.
If int, interpreted as number of bins.
If interable of numeric, will use provided numbers as bin endpoints.
dropna : bool, default True
Don't include counts of NaN.
If False and NaNs are present, NaN will be a key in the output.
.. versionchanged:: 1.1.2

Returns
-------
Expand Down Expand Up @@ -1221,15 +1224,26 @@ def value_counts(

Bins can be useful for going from a continuous variable to a
categorical variable; instead of counting unique
apparitions of values, divide the index in the specified
number of half-open bins.
instances of values, count the number of values that fall
into half-open intervals.

Bins can be an int.

>>> s.value_counts(bins=3)
(2.0, 3.0] 2
(0.996, 2.0] 2
(3.0, 4.0] 1
dtype: int64

Bins can also be an iterable of numbers. These numbers are treated
as endpoints for the intervals.

>>> s.value_counts(bins=[0, 2, 4, 9])
(2.0, 4.0] 3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a space missing here?

(-0.001, 2.0] 2
(4.0, 9.0] 0
dtype: int64

**dropna**

With `dropna` set to `False` we can also see NaN index values.
Expand All @@ -1242,6 +1256,7 @@ def value_counts(
1.0 1
dtype: int64
"""

result = value_counts(
self,
sort=sort,
Expand Down
164 changes: 82 additions & 82 deletions pandas/core/groupby/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,6 @@
ensure_platform_int,
is_bool,
is_integer_dtype,
is_interval_dtype,
is_numeric_dtype,
is_object_dtype,
is_scalar,
Expand All @@ -59,6 +58,7 @@
validate_func_kwargs,
)
import pandas.core.algorithms as algorithms
from pandas.core.algorithms import unique
from pandas.core.arrays import ExtensionArray
from pandas.core.base import DataError, SpecificationError
import pandas.core.common as com
Expand All @@ -79,6 +79,7 @@
import pandas.core.indexes.base as ibase
from pandas.core.internals import BlockManager
from pandas.core.series import Series
from pandas.core.sorting import compress_group_index
from pandas.core.util.numba_ import NUMBA_FUNC_CACHE, maybe_use_numba

from pandas.plotting import boxplot_frame_groupby
Expand Down Expand Up @@ -685,7 +686,6 @@ def value_counts(
self, normalize=False, sort=True, ascending=False, bins=None, dropna=True
):

from pandas.core.reshape.merge import get_join_indexers
from pandas.core.reshape.tile import cut

if bins is not None and not np.iterable(bins):
Expand All @@ -701,111 +701,111 @@ def value_counts(

ids, _, _ = self.grouper.group_info
val = self.obj._values
codes = self.grouper.reconstructed_codes # this will track the groups

# groupby removes null keys from groupings
mask = ids != -1
ids, val = ids[mask], val[mask]
if dropna:
mask = ~isna(val)
if not mask.all():
ids, val = ids[mask], val[mask]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

codecov reports no testing here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most features are tested in test_series_groupby_value_counts, which is parameterized and includes dropna as a parameter. That should address the below case as well (since it includes seed_nans in the dataframe generation. I could add specific tests for these cases, but it seems like they are covered in the existing tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonjayhawkins IIUC codecov gets results from the travis-37-cov build, which runs with not slow, so we wouldn't expect GroupBy.value_counts to show up there


if bins is None:
lab, lev = algorithms.factorize(val, sort=True)
llab = lambda lab, inc: lab[inc]
val_lab, val_lev = algorithms.factorize(
val, sort=True, na_sentinel=(None if dropna else -1)
)
else:
# val_lab is a Categorical with categories an IntervalIndex
val_lab = cut(Series(val), bins, include_lowest=True)
val_lev = val_lab.cat.categories
val_lab = val_lab.cat.codes.values

# lab is a Categorical with categories an IntervalIndex
lab = cut(Series(val), bins, include_lowest=True)
lev = lab.cat.categories
lab = lev.take(lab.cat.codes)
llab = lambda lab, inc: lab[inc]._multiindex.codes[-1]

if is_interval_dtype(lab.dtype):
# TODO: should we do this inside II?
sorter = np.lexsort((lab.left, lab.right, ids))
else:
sorter = np.lexsort((lab, ids))
if dropna:
included = val_lab != -1
ids, val_lab = ids[included], val_lab[included]

ids, lab = ids[sorter], lab[sorter]
sorter = np.lexsort((val_lab, ids))
ids, val_lab = ids[sorter], val_lab[sorter]
used_ids = unique(ids)
if max(used_ids) >= len(codes[0]):
# this means we had something skipped from the start
used_ids = compress_group_index(used_ids)[0]
codes = [code[used_ids] for code in codes] # drop what was taken out for n/a

# group boundaries are where group ids change
idx = np.r_[0, 1 + np.nonzero(ids[1:] != ids[:-1])[0]]

# new values are where sorted labels change
lchanges = llab(lab, slice(1, None)) != llab(lab, slice(None, -1))
inc = np.r_[True, lchanges]
inc[idx] = True # group boundaries are also new values
out = np.diff(np.nonzero(np.r_[inc, True])[0]) # value counts

# num. of times each group should be repeated
rep = partial(np.repeat, repeats=np.add.reduceat(inc, idx))

# multi-index components
codes = self.grouper.reconstructed_codes
codes = [rep(level_codes) for level_codes in codes] + [llab(lab, inc)]
levels = [ping.group_index for ping in self.grouper.groupings] + [lev]
change_ids = ids[1:] != ids[:-1]
changes = np.logical_or(change_ids, (val_lab[1:] != val_lab[:-1]))
changes = np.r_[True, changes]
val_lab = val_lab[changes]
ids = ids[changes]
cts = np.diff(np.nonzero(np.r_[changes, True]))[0]
idx = np.r_[0, 1 + np.nonzero(change_ids)[0]]
# how many times each index gets repeated
rep = partial(np.repeat, repeats=np.add.reduceat(changes, idx))

if (not dropna) and (-1 in val_lab):
# in this case we need to explicitly add NaN as a level
val_lev = np.r_[Index([np.nan]), val_lev]
val_lab += 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

codecov reports no testing for this


levels = [ping.group_index for ping in self.grouper.groupings] + [
Index(val_lev)
]
names = self.grouper.names + [self._selection_name]

if dropna:
mask = codes[-1] != -1
if mask.all():
dropna = False
else:
out, codes = out[mask], [level_codes[mask] for level_codes in codes]

if normalize:
out = out.astype("float")
d = np.diff(np.r_[idx, len(ids)])
if dropna:
m = ids[lab == -1]
np.add.at(d, m, -1)
acc = rep(d)[mask]
else:
acc = rep(d)
out /= acc

if sort and bins is None:
cat = ids[inc][mask] if dropna else ids[inc]
sorter = np.lexsort((out if ascending else -out, cat))
out, codes[-1] = out[sorter], codes[-1][sorter]
num_repeats = np.diff(idx, append=len(change_ids) + 1)
cts = cts.astype("float") / rep(num_repeats)
# each divisor is the number of repeats for that index

if bins is None:
codes = [rep(level_codes) for level_codes in codes] + [val_lab]

if sort:
indices = tuple(reversed(codes[:-1]))
sorter = np.lexsort(
np.r_[[val_lab], [cts if ascending else -cts], indices]
) # sorts using right columns first
cts = cts[sorter]
codes = [code[sorter] for code in codes]

mi = MultiIndex(
levels=levels, codes=codes, names=names, verify_integrity=False
)

if is_integer_dtype(out):
out = ensure_int64(out)
return self.obj._constructor(out, index=mi, name=self._selection_name)
if is_integer_dtype(cts):
cts = ensure_int64(cts)
return self.obj._constructor(cts, index=mi, name=self._selection_name)

# for compat. with libgroupby.value_counts need to ensure every
# bin is present at every index level, null filled with zeros
diff = np.zeros(len(out), dtype="bool")
for level_codes in codes[:-1]:
diff |= np.r_[True, level_codes[1:] != level_codes[:-1]]

ncat, nbin = diff.sum(), len(levels[-1])

left = [np.repeat(np.arange(ncat), nbin), np.tile(np.arange(nbin), ncat)]

right = [diff.cumsum() - 1, codes[-1]]

_, idx = get_join_indexers(left, right, sort=False, how="left")
out = np.where(idx != -1, out[idx], 0)
nbin = len(levels[-1])
ncat = len(codes[0])
fout = np.zeros((ncat * nbin), dtype=float if normalize else np.int64)
id = 0
change_ids = np.r_[ # need to update now that we removed full repeats
ids[1:] != ids[:-1], True
]
for i, ct in enumerate(cts): # fill in nonzero values of fout
fout[id * nbin + val_lab[i]] = cts[i]
id += change_ids[i]
ncodes = [np.repeat(code, nbin) for code in codes]
ncodes.append(np.tile(range(nbin), len(codes[0])))

if sort:
sorter = np.lexsort((out if ascending else -out, left[0]))
out, left[-1] = out[sorter], left[-1][sorter]

# build the multi-index w/ full levels
def build_codes(lev_codes: np.ndarray) -> np.ndarray:
return np.repeat(lev_codes[diff], nbin)

codes = [build_codes(lev_codes) for lev_codes in codes[:-1]]
codes.append(left[-1])

mi = MultiIndex(levels=levels, codes=codes, names=names, verify_integrity=False)

if is_integer_dtype(out):
out = ensure_int64(out)
return self.obj._constructor(out, index=mi, name=self._selection_name)
indices = tuple(reversed(ncodes[:-1]))
sorter = np.lexsort(
np.r_[[fout if ascending else -fout], indices]
) # sorts using right columns first
fout = fout[sorter]
ncodes = [code[sorter] for code in ncodes]
mi = MultiIndex(
levels=levels, codes=ncodes, names=names, verify_integrity=False
)
if is_integer_dtype(fout):
fout = ensure_int64(fout)
return self.obj._constructor(fout, index=mi, name=self._selection_name)

def count(self) -> Series:
"""
Expand Down
Loading