Skip to content

ENH: Improve performance for arrow dtypes in monotonic join #51365

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Feb 16, 2023

Conversation

phofl
Copy link
Member

@phofl phofl commented Feb 13, 2023

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.
idx = Index(list(range(1, 1_000_000)), dtype="int64[pyarrow]")
idx2 = Index(list(range(100_000, 1_100_000)), dtype="int64[pyarrow]")
idx.union(idx2)

# main
# %timeit idx.union(idx2)
# 327 ms ± 72.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# pr
# %timeit idx.union(idx2)
# 2.79 ms ± 27.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@jbrockmendel
Copy link
Member

no objection here, but eventually we ought to find a way to do this dispatch without special-casing inside the index code (i.e. implement something at the EA level)

hows is perf affected on multi-chunk pyarrow objs?

@phofl
Copy link
Member Author

phofl commented Feb 13, 2023

Arrays have 2 million entries, initial performance 380ms, on this pr

%timeit idx.union(idx2)
11.5 ms ± 477 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

2 chunks
I totally agree with your point regarding EA interface, this is a short term solution for 2.0

@phofl
Copy link
Member Author

phofl commented Feb 15, 2023

@jbrockmendel ok to merge?

@phofl phofl added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Arrow pyarrow functionality labels Feb 15, 2023
@phofl phofl added this to the 2.0 milestone Feb 15, 2023
elif isinstance(self.values, ArrowExtensionArray):
import pyarrow as pa

return type(self.values)(pa.array(result))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_from_sequence?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, changed

Copy link
Member

@jbrockmendel jbrockmendel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@phofl phofl merged commit d82f9dd into pandas-dev:main Feb 16, 2023
@phofl phofl deleted the pyarrow_monotonic_join branch February 16, 2023 10:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants