Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: groupby.transform with execution engine numba does not work in multiindex case since 1.3 #46867

Closed
2 of 3 tasks
derHeinzer opened this issue Apr 25, 2022 · 5 comments · Fixed by #47057
Closed
2 of 3 tasks
Labels
Bug numba numba-accelerated operations Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@derHeinzer
Copy link

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd


df = pd.DataFrame([{'A': 1, 'B': 2, 'C': 3}]).set_index(["A", "B"])


def numba_func(values, index):
    return 1


res = df.groupby('A').transform(
    numba_func, engine="numba"
)

Issue Description

Traceback (most recent call last):
File "/home/ninja/workspace/lumen_psf/python/psf/script/dev_tst.py", line 11, in
res = df.groupby('A').transform(
File "/home/ninja/miniconda3/envs/py39pandas13/lib/python3.9/site-packages/pandas/core/groupby/generic.py", line 1357, in transform
return self._transform(
File "/home/ninja/miniconda3/envs/py39pandas13/lib/python3.9/site-packages/pandas/core/groupby/groupby.py", line 1430, in _transform
result = self._transform_with_numba(
File "/home/ninja/miniconda3/envs/py39pandas13/lib/python3.9/site-packages/pandas/core/groupby/groupby.py", line 1187, in _transform_with_numba
result = numba_transform_func(
File "/home/ninja/miniconda3/envs/py39pandas13/lib/python3.9/site-packages/numba/core/dispatcher.py", line 468, in compile_for_args
error_rewrite(e, 'typing')
File "/home/ninja/miniconda3/envs/py39pandas13/lib/python3.9/site-packages/numba/core/dispatcher.py", line 409, in error_rewrite
raise e.with_traceback(None)
numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
non-precise type array(pyobject, 1d, C)
During: typing of argument at /home/ninja/miniconda3/envs/py39pandas13/lib/python3.9/site-packages/pandas/core/groupby/numba
.py (167)

File "../../../../../miniconda3/envs/py39pandas13/lib/python3.9/site-packages/pandas/core/groupby/numba_.py", line 167:
def group_transform(

) -> np.ndarray:
result = np.empty((len(values), num_columns))
^

Expected Behavior

Up to pandas version 1.2 this worked like a charm. Apparently the redesign of the grouping did not respect to multi-index cases when using calculation engine numba.

Installed Versions

INSTALLED VERSIONS

commit : 66e3805
python : 3.9.12.final.0
python-bits : 64
OS : Linux
OS-release : 5.13.0-40-generic
Version : #45~20.04.1-Ubuntu SMP Mon Apr 4 09:38:31 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.3.5
numpy : 1.21.6
pytz : 2022.1
dateutil : 2.8.2
pip : 22.0.4
setuptools : 59.8.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : 0.55.1

@derHeinzer derHeinzer added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 25, 2022
@simonjayhawkins simonjayhawkins added the numba numba-accelerated operations label May 18, 2022
simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue May 18, 2022
@simonjayhawkins simonjayhawkins added the Regression Functionality that used to work in a prior pandas version label May 18, 2022
@simonjayhawkins
Copy link
Member

Thanks @derHeinzer for the report.

Up to pandas version 1.2 this worked like a charm.

first bad commit: [5512bc4] Backport PR #43172: BUG: Pass index data correctly in groupby.transform/agg w/ engine=numba (#43250)

cc @mroeschke

@simonjayhawkins simonjayhawkins removed the Needs Triage Issue that has not been reviewed by a pandas team member label May 18, 2022
@jreback jreback added this to the 1.4.3 milestone May 19, 2022
@derHeinzer
Copy link
Author

This bug still rises in some situations in pandas 1.4.4.

It occured to me in a situation where the first index dimension did not have a numpy dtype but the pandas wrapper dtype pd.core.arrays.integer.Int64Dtype. This is the case, when the dataframe is loaded via google cloud bigquerys python library for example.

What are these pandas dtypes good for anyways?!

minimal reproducable example:

import pandas as pd
p = 10
i = 20
id_name = "id"
date_name = "date"
idx_names = [id_name, date_name]
dr = pd.date_range("2020-01-01", periods=p)
idx1 = pd.MultiIndex.from_product([range(i), dr], names=idx_names)
idx2 = pd.MultiIndex.from_product([range(i), dr], names=idx_names)

s1 = pd.Series(range(p*i), index=idx1)
s2 = pd.Series(range(p*i), index=idx2)

# set dtyepe of first index dim to pandas native dtype
s2 = s2.reset_index()
s2[id_name] = s2[id_name].astype(pd.core.arrays.integer.Int64Dtype())
s2 = s2.set_index(idx_names)

s1.groupby(id_name).transform(
    lambda values, index: 1, engine="numba"
)  # works
s2.groupby(id_name).transform(
    lambda values, index: 1, engine="numba"
)  # fails

@mroeschke
Copy link
Member

The numba functionality has not been adapted to work with pandas ExtensionTypes yet, but a pull request would be most welcome!

What are these pandas dtypes good for anyways?!

pandas nullable types are still relatively new (since 1.0), and a lot of effort has been made to extend it's compatibility throughout the large API. Please be mindful of the tone as pandas development is still largely supported by volunteer contributions.

@derHeinzer
Copy link
Author

Sorry I did not mend to offend anybody. Just wanted to report this bug and wanted to know, what its all about with the extension Dtypes. I read about the nullability, but did not have that in mind right away.

If I used inappropriate language, I want to appologize. English ist not my first language though :)

@mroeschke
Copy link
Member

No problem. I opened up #48447 to track numba support with pandas nullable types

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug numba numba-accelerated operations Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants