BUG (string): ArrowStringArray.find corner cases #59562

jbrockmendel · 2024-08-20T18:12:35Z

closes #xxxx (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

jorisvandenbossche · 2024-08-20T19:48:18Z

This will only work if we also backport #56792

pandas/tests/strings/test_find_replace.py

jorisvandenbossche · 2024-08-21T20:04:52Z

This will only work if we also backport #56792

Backporting that specific PR is of course certainly possible, but if we want to avoid in general to have to backport older fixes, we could also copy the _str_find implementation of ArrowEA to ArrowStringArray (i.e. actually duplicating it), and only do the actual de-duplication later.

jorisvandenbossche · 2024-08-21T20:56:37Z

(moving the most up to date implementation directly to the shared mixin (or to ArrowStringArray, depending on questions in #59555), would actually also avoid requiring to backport the older PR)

jbrockmendel · 2024-08-22T16:19:06Z

@jorisvandenbossche ive lost track of all your comments across several places. pls advise on the preferred course of action.

jorisvandenbossche · 2024-08-27T08:18:17Z

I think moving the shared _str_find method to the mixin would be best, if possible: that's consistent with how we are sharing code in other PRs now, and that also ensures that we don't have to care about the older backport, because this PR will simply put the fixed _str_find code in a new place, ensuring we get the older fix with it, also for StringDtype.

pandas/core/arrays/_arrow_string_mixins.py

jorisvandenbossche · 2024-08-27T17:08:40Z

pandas/core/arrays/string_arrow.py

+            and not (start != 0 and end is not None)
+            and not (start == 0 and end is None)
+        ):
+            # https://github.com/pandas-dev/pandas/pull/59562/files#r1725688888


This link points to this PR, but it doesn't seem to work to actually link to a specific comment

Ah, I suppose it points to #59562 (comment)

Now, can't we move that into the mixin as well? (avoiding this override) This is something that was just buggy in pyarrow before that version AFAIU, so I think there is no harm in also doing object-dtype fallback for ArrowDtype, since it otherwise just errors wrongly

huh i couldve sworn i copy/pasted it from somewhere, but now i cant find it. will update

Can you check my question in my second comment above?

About moving the implementation to the mixin? This PR now does that.

It's about the if block that this thread is commenting on, that is not in the mixin and my question is whether it shouldn't be moved as well? (it's a bug that affects ArrowEA as well AFAIU)

thanks for clarifying. will give it a try.

jorisvandenbossche · 2024-08-27T17:12:22Z

pandas/core/arrays/_arrow_string_mixins.py

+        # Convert an int-dtype arrow result to an appropriate output type.
+        raise NotImplementedError
+
+    def _apply_elementwise(self, func: Callable) -> list[list[Any]]:


This needs to be implemented for ArrowStringArray as well then?

this is only used for the ArrowEA version. The ArrowStringArray goes through _str_map, which ArrowEA doesn't have. eventually id like to align the names, but there are too many branches/PRs as it is.

Not sure what I am missing, but _apply_elementwise is called from the now-shared _str_find method just below, and so I would think that you can also get there from ArrowStringArray._str_find ?

yep my bad. ArrowStringArray inherits ArrowEA so gets its apply_elementwise from there. putting it here just prevents mypy from complaining

jbrockmendel · 2024-08-28T20:01:08Z

I think comments have been addressed here

jorisvandenbossche · 2024-08-30T14:14:20Z

pandas/core/arrays/string_arrow.py

+        if (
+            pa_version_under13p0
+            and not (start != 0 and end is not None)
+            and not (start == 0 and end is None)
+        ):
+            # GH#59562
            return super()._str_find(sub, start, end)
-        return self._convert_int_result(result)
+        return ArrowStringArrayMixin._str_find(self, sub, start, end)


Now that this special case is moved in the mixin method, I would expect this can be removed entirely? (and replaced with a _str_find = ArrowStringArrayMixin._str_find)

this goes through a cython path instead of iterating in python

Ah, through _str_map using lib.map_infer_mask, I suppose. But if there is a cython implementation that is presumably faster, shouldn't we use that for the ArrowDtype as well?
I saw that in the center PR at https://github.com/pandas-dev/pandas/pull/59624/files#diff-ca6e5560b2fc1721e129b85f10882df8a1f20b5f1ef4dff547170fa35898dfa6R62 you didn't use _apply_elementwise but also explicitly went through object dtype. That's for the same reason? Can we use the same pattern?

Sure, changed.

looks like doing this broke the min_versions build, so reverted

jbrockmendel · 2024-08-31T16:22:36Z

Apply element wise is the existing pattern the ArrowEA uses instead of strmap. At some point I’d like to align/optimize it, but am not there yet.

…

On Sat, Aug 31, 2024 at 9:13 AM Joris Van den Bossche < ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In pandas/core/arrays/string_arrow.py <#59562 (comment)>: > + if ( + pa_version_under13p0 + and not (start != 0 and end is not None) + and not (start == 0 and end is None) + ): + # GH#59562 return super()._str_find(sub, start, end) - return self._convert_int_result(result) + return ArrowStringArrayMixin._str_find(self, sub, start, end) Ah, through _str_map using lib.map_infer_mask, I suppose. But if there is a cython implementation that is presumably faster, shouldn't we use that for the ArrowDtype as well? I saw that in the center PR at https://github.com/pandas-dev/pandas/pull/59624/files#diff-ca6e5560b2fc1721e129b85f10882df8a1f20b5f1ef4dff547170fa35898dfa6R62 you didn't use _apply_elementwise but also explicitly went through object dtype. That's for the same reason? Can we use the same pattern? — Reply to this email directly, view it on GitHub <#59562 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB5UM6ENW65K2JXBVQPIWG3ZUHTS5AVCNFSM6AAAAABM2N2WDSVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDENZTHEZTMNBZGE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

jorisvandenbossche · 2024-08-31T16:36:20Z

Yes, but I am talking about two open PRs. Can't we already align in that? (I am not speaking about other code in arrow/array.py, but only about things we now refactor and move to _arrow_string_mixins.py

jorisvandenbossche · 2024-09-06T15:06:01Z

pandas/tests/extension/test_arrow.py

+    arrow_str_series = s.astype(pd.StringDtype(storage="pyarrow"))
+    result2 = arrow_str_series.str.find(sub, start, end).astype(result.dtype)
+    tm.assert_series_equal(result2, expected)


For future PRs, we should add such tests to pandas/tests/strings, I think (because now it is testing StringDtype in tests specifically for ArrowDtype ..)

jorisvandenbossche · 2024-09-06T15:06:23Z

Thanks!

jorisvandenbossche added the Strings String extension data type and string data label Aug 20, 2024

jorisvandenbossche reviewed Aug 21, 2024

View reviewed changes

pandas/tests/strings/test_find_replace.py Outdated Show resolved Hide resolved

jbrockmendel force-pushed the bug-str-find branch 2 times, most recently from e4c2157 to 3433cec Compare August 26, 2024 21:24

jbrockmendel mentioned this pull request Aug 27, 2024

REF (string): de-duplicate ArrowStringArray methods #59555

Merged

5 tasks

jbrockmendel force-pushed the bug-str-find branch from 3433cec to f79b072 Compare August 27, 2024 14:52

jorisvandenbossche mentioned this pull request Aug 27, 2024

REF (string): de-duplicate str_endswith, startswith #59568

Merged

jorisvandenbossche reviewed Aug 27, 2024

View reviewed changes

pandas/core/arrays/_arrow_string_mixins.py Outdated Show resolved Hide resolved

jorisvandenbossche reviewed Aug 27, 2024

View reviewed changes

mroeschke added this to the 2.3 milestone Aug 28, 2024

jbrockmendel force-pushed the bug-str-find branch from 78af1b9 to 84edcb5 Compare August 28, 2024 18:15

jbrockmendel force-pushed the bug-str-find branch 2 times, most recently from a33325c to 6e8e2ce Compare August 29, 2024 22:16

jorisvandenbossche reviewed Aug 30, 2024

View reviewed changes

jbrockmendel force-pushed the bug-str-find branch 3 times, most recently from 4b878ce to 28aa96b Compare September 4, 2024 21:58

jbrockmendel added 4 commits September 5, 2024 18:05

BUG (string): ArrowStringArray.find corner cases

7a99bdb

xfail on old pyarrow

f7f19d3

fallback with older pyarrow

d9f0aa7

REF: move implementation to ArrowStringArrayMixin

f11921e

jbrockmendel added 6 commits September 5, 2024 18:06

mypy fixup

c34ae46

trim bad link

86ef129

fallback to pointwise for ArrowEA

472f17a

mypy fixup

e4c782c

TST: un-xfail

e1b7913

fixup post-rebase

8f07638

jbrockmendel force-pushed the bug-str-find branch from 28aa96b to 8f07638 Compare September 6, 2024 01:06

jorisvandenbossche approved these changes Sep 6, 2024

View reviewed changes

jorisvandenbossche merged commit 3f8d3e4 into pandas-dev:main Sep 6, 2024
46 of 47 checks passed

jbrockmendel deleted the bug-str-find branch September 6, 2024 15:57

jorisvandenbossche added the backported label Oct 10, 2024

jorisvandenbossche pushed a commit to jorisvandenbossche/pandas that referenced this pull request Oct 10, 2024

BUG (string): ArrowStringArray.find corner cases (pandas-dev#59562)

44325c1

jorisvandenbossche pushed a commit that referenced this pull request Oct 10, 2024

BUG (string): ArrowStringArray.find corner cases (#59562)

a065afb

Uh oh!

BUG (string): ArrowStringArray.find corner cases #59562

BUG (string): ArrowStringArray.find corner cases #59562

Conversation

jbrockmendel commented Aug 20, 2024

Uh oh!

jorisvandenbossche commented Aug 20, 2024

Uh oh!

Uh oh!

jorisvandenbossche commented Aug 21, 2024

Uh oh!

jorisvandenbossche commented Aug 21, 2024

Uh oh!

jbrockmendel commented Aug 22, 2024

Uh oh!

jorisvandenbossche commented Aug 27, 2024

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jbrockmendel commented Aug 28, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jbrockmendel commented Aug 31, 2024 via email

Uh oh!

jorisvandenbossche commented Aug 31, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jorisvandenbossche commented Sep 6, 2024

Uh oh!

Uh oh!