Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: groupby.agg with UDF changing pyarrow dtypes #59601

Draft
wants to merge 40 commits into
base: main
Choose a base branch
from

Conversation

rhshadrach
Copy link
Member

@rhshadrach rhshadrach commented Aug 25, 2024

Continuation of #58129

Root cause:

  • agg_series always forces output dtype to be the same as input dtype, but depending on the lambda, the output dtype can be different

Fix:

  • replace all NA with nan
  • convert the `results' to respective pyarrow extension array, using pyarrow library methods
  • pyarrow library methods is used instead of maybe_convert_object, as maybe_convert_object does not check for NA, and forces dtype to float if NA is present (NA is not float in pyarrow),

Kei added 30 commits April 1, 2024 19:04
@rhshadrach rhshadrach marked this pull request as draft August 25, 2024 13:02
@rhshadrach rhshadrach added Groupby Arrow pyarrow functionality pyarrow dtype retention op with pyarrow dtype -> expect pyarrow result Bug and removed Arrow pyarrow functionality labels Aug 25, 2024
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Sep 25, 2024
@rhshadrach rhshadrach changed the title Fix/group by agg pyarrow bool numpy same type BUG: groupby.agg with UDF changing pyarrow dtypes Oct 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Groupby pyarrow dtype retention op with pyarrow dtype -> expect pyarrow result Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Groupby-aggregate on a boolean column returns a different datatype with pyarrow than with numpy
2 participants