Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG (2.0rc0): groupby apply with a UDF changes dtype unexpectedly from double[pyarrow] to float64 #51991

Open
2 of 3 tasks
dylan-lee94 opened this issue Mar 15, 2023 · 2 comments
Labels
Apply Apply, Aggregate, Transform, Map Arrow pyarrow functionality Bug Dtype Conversions Unexpected or buggy dtype conversions Groupby pyarrow dtype retention op with pyarrow dtype -> expect pyarrow result

Comments

@dylan-lee94
Copy link

dylan-lee94 commented Mar 15, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np
pd.options.mode.dtype_backend = 'pyarrow'

df = pd.DataFrame({
    'tags': pd.Series([1,1,1,2,2,2,3,3,3,4,4,4,5,5,5],dtype='int64[pyarrow]'),
    'value': pd.Series(np.random.rand(15),dtype='double[pyarrow]')
    })

result1 = df.groupby('tags')['value'].transform(lambda x: x.sum())
print(result1.dtype)
result2 = df.groupby('tags')['value'].transform('sum')
print(result2.dtype)

result3 = df.groupby('tags')['value'].apply(lambda x: x.sum())
print(result3.dtype)
result4 = df.groupby('tags')['value'].apply('sum')
print(result4.dtype)

Issue Description

Currently having a look at the RC for my current work project. I noticed when using the lambda function with groupby together with apply or transform the dtype changes from double[pyarrow] back to float64.

Expected Behavior

I'd expect that we consistently get the same datatype

Installed Versions

2.0.0rc0

@dylan-lee94 dylan-lee94 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 15, 2023
@mroeschke
Copy link
Member

Thanks for the report. For a UDF (e.g. custom lambda function) passed into transform/apply/agg, if the dtype isn't specified by the UDF, the dtype needs to inferred, and numpy dtypes are still the default implementation returned from inference.

This will be solved once pandas DataFrame/Series constructors can return pyarrow dtypes from inference. We plan to add this as a keyword in a future version, but for this case a global option will probably be needed

@mroeschke mroeschke added Groupby Apply Apply, Aggregate, Transform, Map Arrow pyarrow functionality and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 15, 2023
@mroeschke mroeschke changed the title BUG (2.0rc0): groupby changes dtype unexpectedly from double[pyarrow] to float64 BUG (2.0rc0): groupby apply with a UDF changes dtype unexpectedly from double[pyarrow] to float64 Mar 15, 2023
@mroeschke mroeschke added the Dtype Conversions Unexpected or buggy dtype conversions label Mar 15, 2023
@jbrockmendel
Copy link
Member

There's a lot going on here. apply, transform, and agg have different expectations about what kind of output they see, and corresponding different paths for inference. But they don't enforce those expectations xref #51445. In this case, transform doesn't do any inference internally bc it expects UDFs to return results that already have a dtype, whereas the UDF here returns a scalar.

-1 on a constructor keyword, xref #51846. The idiomatic way to do this is with maybe_cast_pointwise_result as close as possible to the UDF call.

xref #49163 for pyarrow dtype inference on UDF

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Arrow pyarrow functionality Bug Dtype Conversions Unexpected or buggy dtype conversions Groupby pyarrow dtype retention op with pyarrow dtype -> expect pyarrow result
Projects
None yet
Development

No branches or pull requests

3 participants