Skip to content

BUG: inconsistency in dtype of replace() #44897

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Dec 22, 2021
Merged

Conversation

shubham11941140
Copy link
Contributor

@shubham11941140 shubham11941140 commented Dec 15, 2021

Prevented direct return of Regex=True allowing the dtype conversion to happen.

Copy link
Member

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This broke lots of tests

@@ -663,7 +663,9 @@ def replace(
regex = should_use_regex(regex, to_replace)

if regex:
return self._replace_regex(to_replace, value, inplace=inplace)
self.values = np.asarray(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think that this is the right place. Please try doing this deeper when operating with actual arrays

Copy link
Member

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please ensure that ci passes before requesting reviews. And please do thid inside replace regex

@@ -739,7 +739,10 @@ def _replace_regex(
replace_regex(new_values, rx, value, mask)

block = self.make_block(new_values)
return [block]
if self.ndim == 1 or self.shape[0] == 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we use this pattern the 'else' path usually involves split_and_operate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the previous case where the dimensions are as follows split and operate don't operate on these dimensions, for split and operate you need

assert self.ndim == 2 and self.shape[0] != 1, which we do not use here.

Since we have a single element here, we should not split and upcast, right?

Copy link
Member

@jbrockmendel jbrockmendel Dec 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure i understand. im saying its weird to see this check on L734 and not see a split_and_operate on L737.

e.g. in the test you implemented, what happens if you use pd.DataFrame({"A": ["0"], "B": ["0"]})?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In replace_regex, we have aldready applied the replacement with the only issue of the type cast. So, if I apply split_and_operate again, it won't give any specific result except for converting dtype before breaking a lot of tests. The solution is correct with the dtype being wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pd.DataFrame({"A": ["0"], "B: ["0"]}) is giving a dtype assertion error

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is giving a dtype assertion error

can you paste it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rectified it, the testcase has been added.

def test_replace_regex_dtype(self):
# GH-48644
s = pd.Series(["0"])
exp = s.replace(to_replace="0", value=1, regex=False).dtype
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explictly create the expected and use tm.assert_series_equal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, you should create expected = Series(...) not using replace. And please don't use one letter variable names

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments

@jreback jreback added Dtype Conversions Unexpected or buggy dtype conversions Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Dec 16, 2021
@jreback
Copy link
Contributor

jreback commented Dec 17, 2021

looks like this is a more limited form of #44940 cc @jbrockmendel

@jbrockmendel
Copy link
Member

looks like this is a more limited form of #44940 cc @jbrockmendel

Related, but that focuses on the non-regex paths whereas this is only-regex paths.

result_df1 = df1.replace(to_replace="0", value=1, regex=regex)
tm.assert_frame_equal(result_df1, expected_df1)

df2 = DataFrame({"A": ["0"], "B": [np.NaN]})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was unclear in the request: the "B" column should also be strings, just not "0"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@jbrockmendel
Copy link
Member

Looks good. Could use a whatsnew note for the bugfix.

@shubham11941140
Copy link
Contributor Author

Whatsnew note is added @jbrockmendel

@jreback
Copy link
Contributor

jreback commented Dec 19, 2021

@shubham11941140 can you merge master (conflicts in the tests) and ping on green

@shubham11941140
Copy link
Contributor Author

Merged master @jreback

@shubham11941140
Copy link
Contributor Author

I am getting very weird errors which are not related to the big fix, request you to merge so I can move ahead with the next bug fix.

@shubham11941140
Copy link
Contributor Author

@jbrockmendel @jbrockmendel any update?

@shubham11941140
Copy link
Contributor Author

@jreback , any update?

Copy link
Member

@jbrockmendel jbrockmendel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls rebase as well, ping on green.

@@ -824,7 +824,7 @@ Reshaping
- Bug in :meth:`DataFrame.stack` with ``ExtensionDtype`` columns incorrectly raising (:issue:`43561`)
- Bug in :meth:`Series.unstack` with object doing unwanted type inference on resulting columns (:issue:`44595`)
- Bug in :class:`MultiIndex` failing join operations with overlapping ``IntervalIndex`` levels (:issue:`44096`)
-
- Bug in :func:`replace` results is different ``dtype`` based on ``regex`` parameter (:issue:`44864`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use :meth:`DataFrame.replace` and for Series as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@jreback jreback added this to the 1.4 milestone Dec 22, 2021
@shubham11941140
Copy link
Contributor Author

@jreback is is green, request you to merge.

@jreback jreback merged commit b1a2f48 into pandas-dev:master Dec 22, 2021
@jreback
Copy link
Contributor

jreback commented Dec 22, 2021

thanks @shubham11941140

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: inconsistency in dtype of replace()
4 participants