-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: fix Series.str.split when 'regex=None' for series having 'pd.ArrowDtype(pa.string())' dtype #58418
Conversation
pandas/core/arrays/arrow/array.py
Outdated
@@ -2579,7 +2579,7 @@ def _str_split( | |||
n = None | |||
if pat is None: | |||
split_func = pc.utf8_split_whitespace | |||
elif regex: | |||
elif regex or (regex is None and len(pat) != 1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR - I'm not sure this is the right fix though. Do you see where the behavior deviates between the different string types? This current fix seems like it would apply a behavior change to all types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review. The behavior deviates here.
string[pyarrow]
goes through
pandas/pandas/core/strings/object_array.py
Line 327 in a1fc8e8
def _str_split( |
while
pd.ArrowDtype(pa.string())
goes throughpandas/pandas/core/arrays/arrow/array.py
Line 2571 in a1fc8e8
def _str_split( |
The docstring of str.split
says this about regex
: "If None and pat length is not 1, treats pat as a regular expression."
This behavior has been implemented in the first _str_split
, but not in the second _str_split
. So I add this condition in the second _str_split
to fix the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah OK thanks that is helpful. Is there a way to make these implementations look more alike? I see what you are trying to accomplish here but its hard to tell the corner cases where these may still diverge. Is there a reason why the implementations need to differ at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My initial intention was to make as few changes as possible.
To make it more coherent, I would rather set regex=True
for the corner case before calling _str_split
in the code below. Do you think it's OK?
pandas/pandas/core/strings/accessor.py
Lines 911 to 913 in a1fc8e8
if is_re(pat): | |
regex = True | |
result = self._data.array._str_split(pat, n, expand, regex) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I move outside the logic that determines if pat
is a regex, so that the two _str_split
look more alike. Coud you review again?
pandas/tests/extension/test_arrow.py
Outdated
@@ -2296,6 +2296,16 @@ def test_str_split_pat_none(method): | |||
tm.assert_series_equal(result, expected) | |||
|
|||
|
|||
def test_str_split_regex_none(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move this test to pandas/tests/strings/test_split_partition.py
, so we can parametrize this with all the different string dtype implementations, ensuring the different ones all behave the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in the new commit.
But the tests look a bit ugly to me, because the expected output of pd.ArrowDtype(pa.string())
has different array dtype from the cases of other string dtypes. Maybe it's better to keep the test separate in test_arrow.py
?
5031be7
to
6f93a8d
Compare
pandas/conftest.py
Outdated
"string[python]", | ||
pytest.param("string[pyarrow]", marks=td.skip_if_no("pyarrow")), | ||
pytest.param("string[pyarrow_numpy]", marks=td.skip_if_no("pyarrow")), | ||
pytest.param(pd.ArrowDtype(pa.string()), marks=td.skip_if_no("pyarrow")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would actually prefer to just add the pd.ArrowDtype(pa.string())
to the existing string dtypes instead of copying and creating a new fixture. Guessing that causes a lot of other test failures?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I forgot that pd.ArrowDtype(pa.string())
was not actually in the fixture, so my suggestion lead you a bit in the wrong way. Sorry!
Right now adding this to the main any_string_dtype
fixture will indeed give quite some failures, yes. I agree that it might be better to actually do that (and it would be interesting to see which tests actually fail), but that's for another PR / out of scope for this bug fix (doing so would also require removing some tests are now only exist for the arrow string dtype, to not keep things duplicated).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, a lot of other tests need to be adjusted if adding ArrowDtype to the fixture.
So for this PR, should I just add test in pandas/tests/extension/test_arrow.py
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened #58495 so we can track the larger issue
pandas/core/strings/object_array.py
Outdated
elif regex is False: | ||
new_pat = pat | ||
# regex is None so link to old behavior #43563 | ||
else: | ||
if len(pat) == 1: | ||
new_pat = pat | ||
else: | ||
new_pat = re.compile(pat) | ||
new_pat = pat |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want to keep this, otherwise it would not be a bugfix for pd.ArrowDtype(pa.string())
but changing behaviour for all other string dtypes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed above #58418 (comment), the bug is caused by that this logic is not implemented in the _str_split
method of pd.ArrowDtype(pa.string())
. To fix it, and to make two _str_split
implementations look more alike, I moved this logic outside before calling _str_split
. So I think the behaviors for other dtypes have not been changed.
Moreover I think the existing tests have covered all combinations of parameters, and as long as they all pass, the behaviors should still be the same.
This reverts commit 6f93a8d.
I've done some cleaning up and kept the original code. Could you have a further review? @jorisvandenbossche @WillAyd |
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen. |
According to the doc, if regex is None and pat length is not 1, treats pat as a regular expression.
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.