-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG (string): ArrowStringArray.find corner cases #59562
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG (string): ArrowStringArray.find corner cases #59562
Conversation
This will only work if we also backport #56792 |
Backporting that specific PR is of course certainly possible, but if we want to avoid in general to have to backport older fixes, we could also copy the |
(moving the most up to date implementation directly to the shared mixin (or to ArrowStringArray, depending on questions in #59555), would actually also avoid requiring to backport the older PR) |
@jorisvandenbossche ive lost track of all your comments across several places. pls advise on the preferred course of action. |
e4c2157
to
3433cec
Compare
I think moving the shared |
3433cec
to
f79b072
Compare
pandas/core/arrays/string_arrow.py
Outdated
and not (start != 0 and end is not None) | ||
and not (start == 0 and end is None) | ||
): | ||
# https://github.com/pandas-dev/pandas/pull/59562/files#r1725688888 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This link points to this PR, but it doesn't seem to work to actually link to a specific comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I suppose it points to #59562 (comment)
Now, can't we move that into the mixin as well? (avoiding this override) This is something that was just buggy in pyarrow before that version AFAIU, so I think there is no harm in also doing object-dtype fallback for ArrowDtype, since it otherwise just errors wrongly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh i couldve sworn i copy/pasted it from somewhere, but now i cant find it. will update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you check my question in my second comment above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About moving the implementation to the mixin? This PR now does that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's about the if
block that this thread is commenting on, that is not in the mixin and my question is whether it shouldn't be moved as well? (it's a bug that affects ArrowEA as well AFAIU)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for clarifying. will give it a try.
# Convert an int-dtype arrow result to an appropriate output type. | ||
raise NotImplementedError | ||
|
||
def _apply_elementwise(self, func: Callable) -> list[list[Any]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be implemented for ArrowStringArray as well then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is only used for the ArrowEA version. The ArrowStringArray goes through _str_map, which ArrowEA doesn't have. eventually id like to align the names, but there are too many branches/PRs as it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what I am missing, but _apply_elementwise
is called from the now-shared _str_find
method just below, and so I would think that you can also get there from ArrowStringArray._str_find
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep my bad. ArrowStringArray inherits ArrowEA so gets its apply_elementwise from there. putting it here just prevents mypy from complaining
78af1b9
to
84edcb5
Compare
I think comments have been addressed here |
a33325c
to
6e8e2ce
Compare
if ( | ||
pa_version_under13p0 | ||
and not (start != 0 and end is not None) | ||
and not (start == 0 and end is None) | ||
): | ||
# GH#59562 | ||
return super()._str_find(sub, start, end) | ||
return self._convert_int_result(result) | ||
return ArrowStringArrayMixin._str_find(self, sub, start, end) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that this special case is moved in the mixin method, I would expect this can be removed entirely? (and replaced with a _str_find = ArrowStringArrayMixin._str_find
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this goes through a cython path instead of iterating in python
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, through _str_map
using lib.map_infer_mask
, I suppose. But if there is a cython implementation that is presumably faster, shouldn't we use that for the ArrowDtype as well?
I saw that in the center PR at https://github.com/pandas-dev/pandas/pull/59624/files#diff-ca6e5560b2fc1721e129b85f10882df8a1f20b5f1ef4dff547170fa35898dfa6R62 you didn't use _apply_elementwise
but also explicitly went through object dtype. That's for the same reason? Can we use the same pattern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like doing this broke the min_versions build, so reverted
Apply element wise is the existing pattern the ArrowEA uses instead of
strmap. At some point I’d like to align/optimize it, but am not there yet.
…On Sat, Aug 31, 2024 at 9:13 AM Joris Van den Bossche < ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In pandas/core/arrays/string_arrow.py
<#59562 (comment)>:
> + if (
+ pa_version_under13p0
+ and not (start != 0 and end is not None)
+ and not (start == 0 and end is None)
+ ):
+ # GH#59562
return super()._str_find(sub, start, end)
- return self._convert_int_result(result)
+ return ArrowStringArrayMixin._str_find(self, sub, start, end)
Ah, through _str_map using lib.map_infer_mask, I suppose. But if there is
a cython implementation that is presumably faster, shouldn't we use that
for the ArrowDtype as well?
I saw that in the center PR at
https://github.com/pandas-dev/pandas/pull/59624/files#diff-ca6e5560b2fc1721e129b85f10882df8a1f20b5f1ef4dff547170fa35898dfa6R62
you didn't use _apply_elementwise but also explicitly went through object
dtype. That's for the same reason? Can we use the same pattern?
—
Reply to this email directly, view it on GitHub
<#59562 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB5UM6ENW65K2JXBVQPIWG3ZUHTS5AVCNFSM6AAAAABM2N2WDSVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDENZTHEZTMNBZGE>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Yes, but I am talking about two open PRs. Can't we already align in that? (I am not speaking about other code in |
4b878ce
to
28aa96b
Compare
28aa96b
to
8f07638
Compare
arrow_str_series = s.astype(pd.StringDtype(storage="pyarrow")) | ||
result2 = arrow_str_series.str.find(sub, start, end).astype(result.dtype) | ||
tm.assert_series_equal(result2, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For future PRs, we should add such tests to pandas/tests/strings
, I think (because now it is testing StringDtype in tests specifically for ArrowDtype ..)
Thanks! |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.