Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame.nunique is incorrect for DataFrame with no columns #21959

Closed
streamnsight opened this issue Jul 18, 2018 · 6 comments · Fixed by #28213
Closed

DataFrame.nunique is incorrect for DataFrame with no columns #21959

streamnsight opened this issue Jul 18, 2018 · 6 comments · Fixed by #28213
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Regression Functionality that used to work in a prior pandas version

Comments

@streamnsight
Copy link

streamnsight commented Jul 18, 2018

(edit by @TomAugspurger)

Current output:

In [33]: pd.DataFrame(index=[0, 1]).nunique()
Out[33]:
Empty DataFrame
Columns: []
Index: [0, 1]

Expected Output is an empty series:

Out[34]: Series([], dtype: float64)

Not sure what the expected dtype of that Series should be... probably object.

original post below:


Code Sample, a copy-pastable example if possible

With Pandas 0.20.3

# create a DataFrame with 3 rows
df = pd.DataFrame({'a': ['A','B','C']})

# lookup unique values for each column, excluding 'a'
unique = df.loc[:, (df.columns != 'a')].nunique()
# this results in an empty Series, the index is also empty
unique.index.tolist()
>>> []
# and
unique[unique == 1].index.tolist()
>>> []

With pandas 0.23.3

# create a DataFrame with 3 rows
df = pd.DataFrame({'a': ['A','B','C']})

# lookup unique values for each column, excluding 'a'
unique = df.loc[:, (df.columns != 'a')].nunique()
# this results in an empty Series, but the index is not empty
unique.index.tolist()
>>> [1,2,3]
also:
unique[unique == 1].index.tolist()
>>> [1,2,3]

Note:

# if we have don't have an empty df, the behavior of nunique() seems fine:
df = pd.DataFrame({'a': ['A','B','C'], 'b': [1,1,1]})
unique = df.loc[:, (df.columns != 'a')].nunique()

unique[unique == 1]
>>> b    1
>>> dtype: int64
# and
unique[unique == 1].index.tolist()
>>> ['b']

Problem description

The change of behavior is a bit disturbing, and seems like it is a bug:
nunique() ends up creating a Series, and it should be a Series of the df columns, but that doesn't seem to be the case here, instead it is picking up the index of the df.

This is likely related to:

#21932
#21255

I am posting this because in my use case I use the list to drop the columns, but i end up with column names that do not exist in the df

INSTALLED VERSIONS ------------------ commit: None python: 2.7.10.final.0 python-bits: 64 OS: Darwin OS-release: 17.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: None LOCALE: None.None pandas: 0.23.3 pytest: None pip: 10.0.1 setuptools: 39.2.0 Cython: None numpy: 1.13.3 scipy: 0.19.1 pyarrow: None xarray: None IPython: None sphinx: 1.5.5 patsy: 0.5.0 dateutil: 2.6.1 pytz: 2015.7 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: None xlrd: 0.9.4 xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: None sqlalchemy: None pymysql: None psycopg2: 2.7.3.2 (dt dec pq3 ext lo64) jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None
@gfyoung gfyoung added Indexing Related to indexing on series/frames, not to indexes themselves Regression Functionality that used to work in a prior pandas version Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Jul 18, 2018
@gfyoung
Copy link
Member

gfyoung commented Jul 19, 2018

@streamnsight : That indeed bothers me too. Patch is welcome!

What is the latest version where this example fails (i.e. does it break after 0.20.3)?

@streamnsight
Copy link
Author

@gfyoung it seems to work until 0.22

@streamnsight streamnsight changed the title Change in behavior with empty Series / DataFrame between 0.20.3 and 0.23.3 Change in behavior with empty Series / DataFrame between 0.22 and 0.23.3 Jul 19, 2018
@gfyoung
Copy link
Member

gfyoung commented Jul 19, 2018

Ah, so this broke in 0.23.0. Marking for 0.23.4 then.

@gfyoung gfyoung added this to the 0.23.4 milestone Jul 19, 2018
@TomAugspurger TomAugspurger changed the title Change in behavior with empty Series / DataFrame between 0.22 and 0.23.3 DataFrame.nunique is incorrect for DataFrame with no columns Jul 19, 2018
@streamnsight
Copy link
Author

@gfyoung @TomAugspurger
As I pointed out in the original post, I believe this is wider than just a problem with nunique as when I was looking for similar issues, I found a few posts about issues with empty Series / DataFrames. So IMO it is an issue with the way index is managed for an empty Series / DataFrame.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jul 19, 2018 via email

@jreback jreback modified the milestones: 0.23.4, 0.24.0, 0.23.5 Aug 2, 2018
@kokes
Copy link
Contributor

kokes commented Aug 28, 2018

git bisect leads me to 76b35c6, but it's quite hard to tell. The underlying issue talks about indexing

This fixes apply to work correctly when the returned shape mismatches the original. It will try to set the indices if it possible.

Might be a false lead, but it's just what git bisect gave me when running

import pandas as pd
assert len(pd.DataFrame(index=[0, 1]).nunique().index) == 0

@jreback jreback modified the milestones: 0.23.5, 0.24.0 Oct 23, 2018
@jreback jreback modified the milestones: 0.24.0, Contributions Welcome Nov 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants