Skip to content

value_counts unexpected behaviour - bins and dropna #25970

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
krolikowskib opened this issue Apr 3, 2019 · 5 comments
Open

value_counts unexpected behaviour - bins and dropna #25970

krolikowskib opened this issue Apr 3, 2019 · 5 comments
Assignees
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug

Comments

@krolikowskib
Copy link

Code Sample, a copy-pastable example if possible

1st example:

Input:

series = pd.Series([1, 1, 2, 0, 1, np.nan, 4, 4, np.nan, 3])
series.value_counts(True, bins=2)

Output:

(-0.005, 2.0]    0.5
(2.0, 4.0]       0.3
dtype: float64

Expected output:

(-0.005, 2.0]    0.625
(2.0, 4.0]       0.375
dtype: float64

2nd example:

Input:

series = pd.Series([1, 1, 2, 0, 1, np.nan, 4, 4, np.nan, 3])
series.value_counts(True, bins=2, dropna=False)

Output:

(-0.005, 2.0]    0.5
(2.0, 4.0]       0.3
dtype: float64

Expected Output:

(-0.004, 2.0]    0.5
(2.0, 4.0]       0.3
NaN              0.2
dtype: float64

Problem description

dropna argument in value_counts() seems to have no effect when bins is not None. Expected behaviour is better, because it sums up shares up to 1 when dropna=True and shows NaNs when dropna=False.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.3.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 142 Stepping 10, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.24.2
pytest: None
pip: 19.0.3
setuptools: 40.8.0
Cython: None
numpy: 1.16.2
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@mroeschke
Copy link
Member

Here's the relevant section of code:

# count, remove nulls (from the index), and but the bins
result = ii.value_counts(dropna=dropna)
result = result[result.index.notna()]
result.index = result.index.astype('interval')
result = result.sort_index()
# if we are dropna and we have NO values
if dropna and (result.values == 0).all():
result = result.iloc[0:0]
# normalizing is by len of all (regardless of dropna)
counts = np.array([len(ii)])

So it appears that the normalization scheme and the exclusion of nan in the index were intentional design decisions. Not sure of the original premise @jreback

@jschendel
Copy link
Member

1st example
The comment on line 695 seems to indicate that this is the expected behavior, but this is actually inconsistent with the behavior when bins is not specified:

In [1]: import numpy as np; import pandas as pd; pd.__version__
Out[1]: '0.25.0.dev0+359.gcaad3b5e5'

In [2]: s = pd.Series([0, 0, np.nan, 1, 2])

In [3]: s
Out[3]: 
0    0.0
1    0.0
2    NaN
3    1.0
4    2.0
dtype: float64

In [4]: s.value_counts(normalize=True)
Out[4]: 
0.0    0.50
2.0    0.25
1.0    0.25
dtype: float64

It seems like this should be consistent across the board regardless of bins.

2nd example
Removing line 687 seems to make things work as expected, as removing nan is already handled by the dropna parameter on line 686, so 687 seems a bit overeager to remove nan. Not sure if removing 687 breaks anything else though.

@mroeschke mroeschke added the Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff label Apr 5, 2019
@jreback jreback added this to the 0.25.0 milestone Jun 19, 2019
@jreback jreback modified the milestones: 0.25.0, Contributions Welcome Jun 28, 2019
@DataInformer
Copy link

take

@DataInformer
Copy link

This does seem inconsistent and still a problem in the latest version. I plan to alter to so that bin frequencies add up to 1 (when dropna=True).

@DataInformer
Copy link

Testing this also revealed that SeriesGroupBy does value_counts inconsistently with Series. I'm not sure why these tests were passing before. I'll attempt to modify SeriesGroupBy.value_counts as well. Example:
`df = pd.DataFrame([[1,2,3],[4,np.nan,6]])
grouped = df.groupby(0)[1]
grouped.value_counts(dropna=False, normalize=True)
Out[12]:
0 1
1 2.0 1.0
4 NaN 1.0
Name: 1, dtype: float64

df[1].value_counts(dropna=False, normalize=True)
Out[13]:
NaN 0.5
2.0 0.5`

@mroeschke mroeschke added the Bug label May 13, 2020
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug
Projects
None yet
5 participants