value_counts unexpected behaviour - bins and dropna #25970

krolikowskib · 2019-04-03T08:44:43Z

Code Sample, a copy-pastable example if possible

1st example:

Input:

series = pd.Series([1, 1, 2, 0, 1, np.nan, 4, 4, np.nan, 3])
series.value_counts(True, bins=2)

Output:

(-0.005, 2.0]    0.5
(2.0, 4.0]       0.3
dtype: float64

Expected output:

(-0.005, 2.0]    0.625
(2.0, 4.0]       0.375
dtype: float64

2nd example:

Input:

series = pd.Series([1, 1, 2, 0, 1, np.nan, 4, 4, np.nan, 3])
series.value_counts(True, bins=2, dropna=False)

Output:

(-0.005, 2.0]    0.5
(2.0, 4.0]       0.3
dtype: float64

Expected Output:

(-0.004, 2.0]    0.5
(2.0, 4.0]       0.3
NaN              0.2
dtype: float64

Problem description

dropna argument in value_counts() seems to have no effect when bins is not None. Expected behaviour is better, because it sums up shares up to 1 when dropna=True and shows NaNs when dropna=False.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.7.3.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 142 Stepping 10, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.24.2
pytest: None
pip: 19.0.3
setuptools: 40.8.0
Cython: None
numpy: 1.16.2
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

The text was updated successfully, but these errors were encountered:

mroeschke · 2019-04-04T01:41:06Z

Here's the relevant section of code:

pandas/pandas/core/algorithms.py

Lines 685 to 696 in caad3b5

    
           # count, remove nulls (from the index), and but the bins 
        
           result = ii.value_counts(dropna=dropna) 
        
           result = result[result.index.notna()] 
        
           result.index = result.index.astype('interval') 
        
           result = result.sort_index() 
        
           # if we are dropna and we have NO values 
        
           if dropna and (result.values == 0).all(): 
        
               result = result.iloc[0:0] 
        
           # normalizing is by len of all (regardless of dropna) 
        
           counts = np.array([len(ii)])

So it appears that the normalization scheme and the exclusion of nan in the index were intentional design decisions. Not sure of the original premise @jreback

jschendel · 2019-04-04T03:56:58Z

1st example
The comment on line 695 seems to indicate that this is the expected behavior, but this is actually inconsistent with the behavior when bins is not specified:

In [1]: import numpy as np; import pandas as pd; pd.__version__
Out[1]: '0.25.0.dev0+359.gcaad3b5e5'

In [2]: s = pd.Series([0, 0, np.nan, 1, 2])

In [3]: s
Out[3]: 
0    0.0
1    0.0
2    NaN
3    1.0
4    2.0
dtype: float64

In [4]: s.value_counts(normalize=True)
Out[4]: 
0.0    0.50
2.0    0.25
1.0    0.25
dtype: float64

It seems like this should be consistent across the board regardless of bins.

2nd example
Removing line 687 seems to make things work as expected, as removing nan is already handled by the dropna parameter on line 686, so 687 seems a bit overeager to remove nan. Not sure if removing 687 breaks anything else though.

DataInformer · 2020-04-17T14:10:54Z

take

DataInformer · 2020-04-17T14:11:59Z

This does seem inconsistent and still a problem in the latest version. I plan to alter to so that bin frequencies add up to 1 (when dropna=True).

DataInformer · 2020-04-17T17:55:51Z

Testing this also revealed that SeriesGroupBy does value_counts inconsistently with Series. I'm not sure why these tests were passing before. I'll attempt to modify SeriesGroupBy.value_counts as well. Example:
`df = pd.DataFrame([[1,2,3],[4,np.nan,6]])
grouped = df.groupby(0)[1]
grouped.value_counts(dropna=False, normalize=True)
Out[12]:
0 1
1 2.0 1.0
4 NaN 1.0
Name: 1, dtype: float64

df[1].value_counts(dropna=False, normalize=True)
Out[13]:
NaN 0.5
2.0 0.5`

mroeschke added the Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff label Apr 5, 2019

krsnik93 mentioned this issue Jun 18, 2019

Calculate normalized freqs in value_counts correctly when bins is not None #26923

Closed

4 tasks

jreback added this to the 0.25.0 milestone Jun 19, 2019

jreback modified the milestones: 0.25.0, Contributions Welcome Jun 28, 2019

github-actions bot assigned DataInformer Apr 17, 2020

This was referenced Apr 18, 2020

Dataframe Groupby value_counts with bins parameter #32471

Open

Value counts normalize #33652

Closed

mroeschke added the Bug label May 13, 2020

jreback mentioned this issue Dec 31, 2020

BUG: GH38672 SeriesGroupBy.value_counts for categorical #38796

Merged

5 tasks

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

value_counts unexpected behaviour - bins and dropna #25970

value_counts unexpected behaviour - bins and dropna #25970

krolikowskib commented Apr 3, 2019

INSTALLED VERSIONS

mroeschke commented Apr 4, 2019

jschendel commented Apr 4, 2019

DataInformer commented Apr 17, 2020

DataInformer commented Apr 17, 2020

DataInformer commented Apr 17, 2020

value_counts unexpected behaviour - bins and dropna #25970

value_counts unexpected behaviour - bins and dropna #25970

Comments

krolikowskib commented Apr 3, 2019

Code Sample, a copy-pastable example if possible

1st example:

2nd example:

Problem description

Output of pd.show_versions()

INSTALLED VERSIONS

mroeschke commented Apr 4, 2019

jschendel commented Apr 4, 2019

DataInformer commented Apr 17, 2020

DataInformer commented Apr 17, 2020

DataInformer commented Apr 17, 2020

Output of `pd.show_versions()`