Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pd.core.groupby.groupby.DataFrameGroupBy.nth yielding false values with Interval MultiIndex Interval #24205

Closed
Tracked by #7
JoElfner opened this issue Dec 10, 2018 · 11 comments · Fixed by #52968
Closed
Tracked by #7
Assignees
Labels
Filters e.g. head, tail, nth good first issue Groupby Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@JoElfner
Copy link
Contributor

JoElfner commented Dec 10, 2018

Hi,

using a solution for my problem posted on SO, I stumbled upon this bug. Thanks @jorisvandenbossche for looking into that matter and confirming this to be a bug.

np.random.seed(123)
df = pd.DataFrame({'temp_a': np.random.rand(50) * 50 + 20,
                   'temp_b': np.random.rand(50) * 30 + 40,
                   'power_deg': np.random.rand(50),
                   'eta': 1 - np.random.rand(50) / 5},
                  index=pd.date_range(start='20181201', freq='T', periods=50))

df_grpd = df.groupby(
    [pd.cut(df.temp_a, np.arange(0, 100, 5)),  # categorical for temp_a
     pd.cut(df.temp_b, np.arange(0, 100, 5)),   # categorical for temp_b
     pd.cut(df.power_deg, np.arange(0, 1, 1 / 20))  # categorical for power_deg
    ])

df_grpd.nth(1).T
# Out:
temp_a       (65, 70]
temp_b       (50, 55]
power_deg (0.8, 0.85]
temp_a      36.147946
temp_b      57.832956
power_deg    0.983631
eta          0.974274

df_grpd.nth(0).loc[(pd.Interval(65, 70), pd.Interval(50, 55), pd.Interval(0.8, 0.85))].T

# Out:
temp_a       (65, 70]            
temp_b       (50, 55]            
power_deg (0.8, 0.85] (0.8, 0.85]
temp_a      69.038210   46.591379
temp_b      52.510666   57.173709
power_deg    0.846506    0.988345
eta          0.867026    0.885187

Problem description

When using groupby with df.cut it seems that taking the n-th row of a group with df_grpd.nth(n) with n > 1 results in values which lie out of the group boundaries, as shown with df_grpd.nth(1).T in my code example. Furthermore sometimes there are multiple rows per group for n=0, as shown with df_grpd.nth(0).loc[(pd.Interval(65, 70), pd.Interval(50, 55), pd.Interval(0.8, 0.85))].T, also with values outside of the interval.

Expected Output

I guess the expected output is quite clear in this case...
For df_grpd.nth(1).T:

temp_a       (65, 70]
temp_b       (50, 55]
power_deg (0.8, 0.85]
temp_a      66.147946  # some values within the interval...
temp_b      53.832956  # some values within the interval...
power_deg    0.833631  # some values within the interval...
eta          0.974274

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.7.1.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel byteorder: little LC_ALL: None LANG: en LOCALE: None.None

pandas: 0.23.4
pytest: 4.0.1
pip: 18.1
setuptools: 40.6.2
Cython: 0.29
numpy: 1.15.4
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.2.0
sphinx: 1.8.2
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.0.1
openpyxl: 2.5.11
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.1.2
lxml: 4.2.5
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.14
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Is there currently any workaround for this problem? Can I somehow access the group values in any other way?
Thanks in advance!

@WillAyd
Copy link
Member

WillAyd commented Dec 10, 2018

Can you create a minimal sample to recreate the issue? This is a rather large one so harder to reason about than it needs to be

@WillAyd WillAyd added Groupby Needs Info Clarification about behavior needed to assess issue labels Dec 10, 2018
@JoElfner
Copy link
Contributor Author

JoElfner commented Dec 10, 2018

Yep, reduced it a bit:

np.random.seed(123)

df = pd.DataFrame({'temp_a': np.random.rand(15) * 50 + 20,
                   'eta': 1 - np.random.rand(15) / 5},
                  index=pd.date_range(start='20181201', freq='T', periods=15))

df_grpd = df.groupby(
    [pd.cut(df.temp_a, np.arange(20, 70 + 1, 25)),  # categorical for temp_a
    ])

df_grpd.nth(0)
df_grpd.nth(1)
df_grpd.nth(2)  # And so on...

@jorisvandenbossche
Copy link
Member

@scootty1 your last example code produces a lot of NaNs in the intervals, though

@jorisvandenbossche
Copy link
Member

Aha, and that actually also seems to be the problem. So based on that, creating a smaller example:

In [147]: idx = pd.MultiIndex([pd.CategoricalIndex([pd.Interval(0, 1), pd.Interval(1, 2)]), pd.CategoricalIndex([pd.Interval(0, 10), pd.Interval(10, 20)])], [[0, 0, 0, 1, 1], [0, 1, 1, 0, -1]])                   

In [148]: df = pd.DataFrame({'col': range(len(idx))}, index=idx)                                                                                                                                                    

In [149]: df                                                                                                                                                                                                        
Out[149]: 
                 col
(0, 1] (0, 10]     0
       (10, 20]    1
       (10, 20]    2
(1, 2] (0, 10]     3
       NaN         4

In [150]: df.groupby(level=[0, 1]).nth(0)                                                                                                                                                                           
Out[150]: 
                 col
(0, 1] (0, 10]     0
       (10, 20]    1
(1, 2] (0, 10]     3
       (0, 10]     4

So the result is kind of correct, apart from the fact that the NaN in the index gets filled with the previous category.

I have a vague recollection we had an issue about this before.

@JoElfner
Copy link
Contributor Author

JoElfner commented Dec 10, 2018

Interestingly it is not showing the NaNs in my case... But I extended the bins to not produce any NaNs.

@jorisvandenbossche
Copy link
Member

But I extended the bins to not produce any NaNs.

But do you still see the bug now? (I don't with your updated example from #24205 (comment))

@JoElfner
Copy link
Contributor Author

Nope, seems to be gone. Didn't check that thoroughly enough. But still strange that I don't see any NaNs, even when I reproduce joris' example.

@jorisvandenbossche
Copy link
Member

But still strange that I don't see any NaNs, even when I reproduce joris' example.

Can you print the full code and output of what you did?

@JoElfner
Copy link
Contributor Author

No, unluckily not. After a restart of the console my output is the same as in your example. Should have restarted right away... But before the restart I had just copied your example to the IPython Console.

@jorisvandenbossche jorisvandenbossche added Bug and removed Needs Info Clarification about behavior needed to assess issue labels Dec 10, 2018
@mroeschke
Copy link
Member

The example in #24205 (comment) looks fixed. Could use a test

In [6]: In [150]: df.groupby(level=[0, 1]).nth(0)
   ...: 
<ipython-input-6-22766b7e6d27>:1: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df.groupby(level=[0, 1]).nth(0)
Out[6]: 
                 col
(0, 1] (0, 10]     0
       (10, 20]    1
(1, 2] (0, 10]     3

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug labels Apr 5, 2023
@ltartaro
Copy link
Contributor

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment