Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing broken inside groupby - apply #33058

Closed
diegodlh opened this issue Mar 27, 2020 · 5 comments
Closed

Indexing broken inside groupby - apply #33058

diegodlh opened this issue Mar 27, 2020 · 5 comments
Labels
Groupby Needs Tests Unit test(s) needed to prevent regressions Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@diegodlh
Copy link

Code Sample, a copy-pastable example if possible

import pandas as pd
import pdb

df = pd.DataFrame(
	{
		'col1': ['A', 'A', 'A', 'B', 'B', 'B'],
		'col2': [1, 2, 3, 4, 5, 6],
	}
)

def fn(x):
	pdb.set_trace()
	x.col2[x.index[-1]] = 0
	return x.col2

result = df.groupby(['col1'], as_index=False).apply(fn)
print(result)

Problem description

The expected output is:

0  0    1
   1    2
   2    0
1  3    4
   4    5
   5    0

Instead, I get a Series one row longer than expected:

0  0    1
   1    2
   2    0
1  3    4
   4    5
   5    6
   5    0

The problem seems to come from processing the second group (col1 == 'B'), where indices do not match row numbers. If I stand at the breakpoint (pdb.set_trace()), I can run this with the following results:

-> x.col2[x.index[-1]] = 0
(Pdb) x.col2     
3    4
4    5
5    6
Name: col2, dtype: int64
(Pdb) x.col2[5]
*** KeyError: 5
(Pdb) x.col2[5] = 0
(Pdb) x.col2
3    4
4    5
5    6
5    0
Name: col2, dtype: int64
(Pdb) x.col2[5]
5    6
5    0
Name: col2, dtype: int64
(Pdb) x.col2[5] = 0
(Pdb) x.col2
3    4
4    5
5    0
5    0
Name: col2, dtype: int64

Expected output

0  0    1
   1    2
   2    0
1  3    4
   4    5
   5    0

This was working before. Unfortunately, I do not know what Pandas version it was.

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS

commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Linux
OS-release : 5.5.13-050513-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.3
numpy : 1.17.2
pytz : 2019.3
dateutil : 2.8.0
pip : 19.2.3
setuptools : 40.6.2
Cython : 0.29.13
pytest : 5.2.1
hypothesis : None
sphinx : 2.2.0
blosc : None
feather : None
xlsxwriter : 1.2.1
lxml.etree : 4.4.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.8.0
pandas_datareader: None
bs4 : 4.8.0
bottleneck : 1.2.1
fastparquet : None
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 3.1.1
numexpr : 2.7.0
odfpy : None
openpyxl : 3.0.0
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.2.1
pyxlsb : None
s3fs : None
scipy : 1.3.1
sqlalchemy : 1.3.9
tables : 3.5.2
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.1
numba : 0.45.1

@jorisvandenbossche
Copy link
Member

I see the expected output on pandas 0.25.3, and the wrong on pandas 1.0.1. But it seems this is working again on master:

In [62]: pd.__version__ 
Out[62]: '1.1.0.dev0+1005.g476f7685e.dirty'

In [63]: print(result) 
0  0    1
   1    2
   2    0
1  3    4
   4    5
   5    0
Name: col2, dtype: int64

So it would be good to add a test case for this, to ensure it keeps working in the future. If we can figure out which patch fixed it, we could also consider backporting that to 1.0.x

@jorisvandenbossche jorisvandenbossche added Groupby Needs Tests Unit test(s) needed to prevent regressions Regression Functionality that used to work in a prior pandas version labels Mar 27, 2020
@simonjayhawkins
Copy link
Member

If we can figure out which patch fixed it, we could also consider backporting that to 1.0.x

fa48f5f is the first new commit
commit fa48f5f
Author: jbrockmendel jbrockmendel@gmail.com
Date: Wed Mar 11 21:30:02 2020 -0700

REF: implement _get_engine_target (#32611)

@simonjayhawkins
Copy link
Member

I see the expected output on pandas 0.25.3, and the wrong on pandas 1.0.1.

8b6942f is the first bad commit
commit 8b6942f
Author: Marco Neumann marco@crepererum.net
Date: Thu Aug 8 22:43:25 2019 +0200

PERF: break reference cycle in Index._engine (#27607)

Fixes #27585

@simonjayhawkins
Copy link
Member

If we can figure out which patch fixed it, we could also consider backporting that to 1.0.x

fa48f5f is the first new commit
commit fa48f5f
Author: jbrockmendel jbrockmendel@gmail.com
Date: Wed Mar 11 21:30:02 2020 -0700

REF: implement _get_engine_target (#32611)

cc @jbrockmendel

@simonjayhawkins
Copy link
Member

So it would be good to add a test case for this, to ensure it keeps working in the future. If we can figure out which patch fixed it, we could also consider backporting that to 1.0.x

#33072 added the test, xref #33300 to link/track potential backport so we can close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Needs Tests Unit test(s) needed to prevent regressions Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

No branches or pull requests

4 participants