ENH: Improve performance for arrow dtypes in monotonic join #51365

phofl · 2023-02-13T20:56:40Z

closes #xxxx (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

idx = Index(list(range(1, 1_000_000)), dtype="int64[pyarrow]")
idx2 = Index(list(range(100_000, 1_100_000)), dtype="int64[pyarrow]")
idx.union(idx2)

# main
# %timeit idx.union(idx2)
# 327 ms ± 72.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# pr
# %timeit idx.union(idx2)
# 2.79 ms ± 27.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

jbrockmendel · 2023-02-13T21:07:25Z

no objection here, but eventually we ought to find a way to do this dispatch without special-casing inside the index code (i.e. implement something at the EA level)

hows is perf affected on multi-chunk pyarrow objs?

phofl · 2023-02-13T21:13:13Z

Arrays have 2 million entries, initial performance 380ms, on this pr

%timeit idx.union(idx2)
11.5 ms ± 477 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

2 chunks
I totally agree with your point regarding EA interface, this is a short term solution for 2.0

phofl · 2023-02-15T22:44:53Z

@jbrockmendel ok to merge?

jbrockmendel · 2023-02-15T23:49:27Z

pandas/core/indexes/base.py

+        elif isinstance(self.values, ArrowExtensionArray):
+            import pyarrow as pa
+
+            return type(self.values)(pa.array(result))


_from_sequence?

good point, changed

jbrockmendel

LGTM

phofl added 2 commits February 13, 2023 21:55

ENH: Improve performance for arrow dtypes in monotonic join

9bcad62

Add gh ref

ee3b59e

phofl added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Arrow pyarrow functionality labels Feb 15, 2023

phofl added this to the 2.0 milestone Feb 15, 2023

jbrockmendel reviewed Feb 15, 2023

View reviewed changes

jbrockmendel approved these changes Feb 15, 2023

View reviewed changes

Change

df057ad

phofl merged commit d82f9dd into pandas-dev:main Feb 16, 2023

phofl deleted the pyarrow_monotonic_join branch February 16, 2023 10:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: Improve performance for arrow dtypes in monotonic join #51365

ENH: Improve performance for arrow dtypes in monotonic join #51365

Uh oh!

phofl commented Feb 13, 2023 •

edited

Loading

Uh oh!

jbrockmendel commented Feb 13, 2023

Uh oh!

phofl commented Feb 13, 2023 •

edited

Loading

Uh oh!

phofl commented Feb 15, 2023

Uh oh!

jbrockmendel Feb 15, 2023

Uh oh!

phofl Feb 16, 2023

Uh oh!

jbrockmendel left a comment

Uh oh!

Uh oh!

Uh oh!

ENH: Improve performance for arrow dtypes in monotonic join #51365

ENH: Improve performance for arrow dtypes in monotonic join #51365

Uh oh!

Conversation

phofl commented Feb 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jbrockmendel commented Feb 13, 2023

Uh oh!

phofl commented Feb 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

phofl commented Feb 15, 2023

Uh oh!

jbrockmendel Feb 15, 2023

Choose a reason for hiding this comment

Uh oh!

phofl Feb 16, 2023

Choose a reason for hiding this comment

Uh oh!

jbrockmendel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

phofl commented Feb 13, 2023 •

edited

Loading

phofl commented Feb 13, 2023 •

edited

Loading