-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: inconsistency in dtype of replace() #44897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This broke lots of tests
pandas/core/internals/blocks.py
Outdated
@@ -663,7 +663,9 @@ def replace( | |||
regex = should_use_regex(regex, to_replace) | |||
|
|||
if regex: | |||
return self._replace_regex(to_replace, value, inplace=inplace) | |||
self.values = np.asarray( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think that this is the right place. Please try doing this deeper when operating with actual arrays
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please ensure that ci passes before requesting reviews. And please do thid inside replace regex
pandas/core/internals/blocks.py
Outdated
@@ -739,7 +739,10 @@ def _replace_regex( | |||
replace_regex(new_values, rx, value, mask) | |||
|
|||
block = self.make_block(new_values) | |||
return [block] | |||
if self.ndim == 1 or self.shape[0] == 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when we use this pattern the 'else' path usually involves split_and_operate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the previous case where the dimensions are as follows split and operate don't operate on these dimensions, for split and operate you need
assert self.ndim == 2 and self.shape[0] != 1
, which we do not use here.
Since we have a single element here, we should not split and upcast, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure i understand. im saying its weird to see this check on L734 and not see a split_and_operate on L737.
e.g. in the test you implemented, what happens if you use pd.DataFrame({"A": ["0"], "B": ["0"]})
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In replace_regex
, we have aldready applied the replacement with the only issue of the type cast. So, if I apply split_and_operate again, it won't give any specific result except for converting dtype before breaking a lot of tests. The solution is correct with the dtype being wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pd.DataFrame({"A": ["0"], "B: ["0"]})
is giving a dtype assertion error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is giving a dtype assertion error
can you paste it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rectified it, the testcase has been added.
def test_replace_regex_dtype(self): | ||
# GH-48644 | ||
s = pd.Series(["0"]) | ||
exp = s.replace(to_replace="0", value=1, regex=False).dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explictly create the expected and use tm.assert_series_equal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, you should create expected = Series(...) not using replace. And please don't use one letter variable names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comments
looks like this is a more limited form of #44940 cc @jbrockmendel |
Related, but that focuses on the non-regex paths whereas this is only-regex paths. |
result_df1 = df1.replace(to_replace="0", value=1, regex=regex) | ||
tm.assert_frame_equal(result_df1, expected_df1) | ||
|
||
df2 = DataFrame({"A": ["0"], "B": [np.NaN]}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i was unclear in the request: the "B" column should also be strings, just not "0"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Looks good. Could use a whatsnew note for the bugfix. |
Whatsnew note is added @jbrockmendel |
@shubham11941140 can you merge master (conflicts in the tests) and ping on green |
Merged master @jreback |
I am getting very weird errors which are not related to the big fix, request you to merge so I can move ahead with the next bug fix. |
@jbrockmendel @jbrockmendel any update? |
@jreback , any update? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls rebase as well, ping on green.
doc/source/whatsnew/v1.4.0.rst
Outdated
@@ -824,7 +824,7 @@ Reshaping | |||
- Bug in :meth:`DataFrame.stack` with ``ExtensionDtype`` columns incorrectly raising (:issue:`43561`) | |||
- Bug in :meth:`Series.unstack` with object doing unwanted type inference on resulting columns (:issue:`44595`) | |||
- Bug in :class:`MultiIndex` failing join operations with overlapping ``IntervalIndex`` levels (:issue:`44096`) | |||
- | |||
- Bug in :func:`replace` results is different ``dtype`` based on ``regex`` parameter (:issue:`44864`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use :meth:`DataFrame.replace`
and for Series as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@jreback is is green, request you to merge. |
thanks @shubham11941140 |
Prevented direct return of Regex=True allowing the dtype conversion to happen.