Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TST: test for inconsistency due to dtype=string #46512 #47793

Closed
wants to merge 40 commits into from

Conversation

Shadimrad
Copy link
Contributor

@Shadimrad Shadimrad changed the title Issue2 BUG Inconsistency due to dtype=string #46512 Jul 19, 2022
@pep8speaks
Copy link

pep8speaks commented Jul 20, 2022

Hello @Shadimrad! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2022-08-18 23:51:54 UTC

@@ -9500,6 +9500,11 @@ def _where(
self._check_inplace_setting(other)
new_data = self._mgr.putmask(mask=cond, new=other, align=align)
result = self._constructor(new_data)
for i in range(len(result.dtypes)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the wrong place for this. We can not special case this here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind elaborating on what you mean? I believe it is not a special case since it just affects the type in the case that the inplace is True. Do you mean I should put it within the putmask? @phofl

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, but this looks already correct on main, so no need to fix I think.

1.5.0.dev0+1180.g8c3a2f2ba7
<StringArray>
['1', <NA>, '3']
Length: 3, dtype: string
<class 'pandas._libs.missing.NAType'>
      A
0     1
1  <NA>
2     3
<StringArray>
['1', <NA>, '3']
Length: 3, dtype: string
<class 'pandas._libs.missing.NAType'>

Could you simply add a test?

cc @simonjayhawkins

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh! sure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, but this looks already correct on main, so no need to fix I think.

see #46512 (comment)

I'll confirm the commit where the fix occurred and if we agree that this is the correct behavior, then indeed we just need a test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is essentially the same
As #47628

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes from #47628 (comment)

But the strange thing is that doing this on the Series level doesn't end up calling StringArray.setitem, it seems to go through Series._where and eventually BlockManager.putmask, and ExtensionArray._putmask, and that last one is not correctly implemented for StringArray.

but we should probably have tests for DataFrame.where also incase the implementation of __setitem__ changes to no longer go through Series._where

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just out of curiosity, how would it be implemented if we should not have special cased it? Is there a link to the change that fixed it on the main by any chance? @phofl

@Shadimrad Shadimrad changed the title BUG Inconsistency due to dtype=string #46512 TST test for inconsistency due to dtype=string #46512 Jul 20, 2022
@Shadimrad Shadimrad changed the title TST test for inconsistency due to dtype=string #46512 TST: test for inconsistency due to dtype=string #46512 Jul 20, 2022
@mroeschke mroeschke added the Strings String extension data type and string data label Jul 22, 2022
@Shadimrad
Copy link
Contributor Author

take

@Shadimrad Shadimrad marked this pull request as draft August 3, 2022 17:55
@Shadimrad Shadimrad marked this pull request as ready for review August 4, 2022 08:03
def test_consitency_inplace():
df = pd.DataFrame({"M": [""]}, dtype="string")
df2 = pd.DataFrame({"M": [""]}, dtype="string")
df2.where(df2 != "", np.nan, inplace=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to compare df and df2 to a separately created DataFrame e.g.

expected = pd.DataFrame(...)
tm.assert_frame_equal(df, expected)
tm.assert_frame_equal(df2, expected)

@Shadimrad
Copy link
Contributor Author

Shadimrad commented Aug 17, 2022 via email

df = pd.DataFrame({"M": [""]}, dtype="string")
df.where(df != "", np.nan, inplace=True)
expected = expected.where(expected != "", np.nan)
tm.assert_frame_equal(expected, df)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my prior comment, I was referring to that there should be 2 tm.assert_frame_equal so that both inplace=True/False are tested separately e.g.

expected = pd.DataFrame({"M": [""]}, dtype="string")
df_inplace = ...
tm.assert_frame_equal(df_inplace, expected)
df_not_inplace = ...
tm.assert_frame_equal(df_not_inplace, expected)

@mroeschke
Copy link
Member

mroeschke commented Aug 23, 2022

I don't think this test is entirely required for 1.5 so removing that milestone. Once ready, we can scope for the 1.6 branch.

@mroeschke mroeschke removed this from the 1.5 milestone Aug 23, 2022
@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Sep 23, 2022
@mroeschke
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

@mroeschke mroeschke closed this Oct 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Stale Strings String extension data type and string data Testing pandas testing functions or related to the test suite
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Inconsistency in DataFrame.where between inplace and not inplace with na like value for StringArray
5 participants