Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Inconsistency in DataFrame.where between inplace and not inplace with na like value for StringArray #46512

Open
3 tasks done
simonjayhawkins opened this issue Mar 25, 2022 · 2 comments
Labels
Bug ExtensionArray Extending pandas with custom dtypes or arrays. Strings String extension data type and string data

Comments

@simonjayhawkins
Copy link
Member

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

print(pd.__version__)
df = pd.DataFrame({"A": ["1", "", "3"]}, dtype="string")
try:
    result = df.where(df != "", np.nan)
    arr = result["A"]._values
    print(arr)
    print(type(arr[1]))
except Exception as e:
    print(e)
df.where(df != "", np.nan, inplace=True)
print(df)
arr = df["A"]._values
print(arr)
print(type(arr[1]))

Issue Description

code sample based on #46366

1.4.1
StringArray requires a sequence of strings or pandas.NA
     A
0    1
1  NaN
2    3
<StringArray>
['1', nan, '3']
Length: 3, dtype: string
<class 'float'>
1.5.0.dev0+595.gf99ec8bf80
<StringArray>
['1', <NA>, '3']
Length: 3, dtype: string
<class 'pandas._libs.missing.NAType'>
     A
0    1
1  NaN
2    3
<StringArray>
['1', nan, '3']
Length: 3, dtype: string
<class 'float'>

Expected Behavior

The behavior for the inplace=False case has changed from 1.4.1 to main since #45168 allows other na values in the StringArray Constructor.

Whether this is correct for the DataFrame.where case may need discussion. Either way, the results for the inplace=True case look incorrect to me and should be consistent with the inplace=False case.

Installed Versions

.

@simonjayhawkins simonjayhawkins added Bug Needs Triage Issue that has not been reviewed by a pandas team member Strings String extension data type and string data ExtensionArray Extending pandas with custom dtypes or arrays. labels Mar 25, 2022
@mroeschke mroeschke removed the Needs Triage Issue that has not been reviewed by a pandas team member label Jul 6, 2022
@simonjayhawkins
Copy link
Member Author

simonjayhawkins commented Jul 20, 2022

The behavior on main has changed since this issue was opened #47793 (comment)

1.5.0.dev0+1176.gf7e0e68f34
<StringArray>
['1', <NA>, '3']
Length: 3, dtype: string
<class 'pandas._libs.missing.NAType'>
      A
0     1
1  <NA>
2     3
<StringArray>
['1', <NA>, '3']
Length: 3, dtype: string
<class 'pandas._libs.missing.NAType'>

The underlying StringArray is now correct in the sense that the array elements are only string values or pd.NA.

I'll bisect to confirm where fixed, but assuming #47763

Whether this is correct for the DataFrame.where case may need discussion. Either way, the results for the inplace=True case look incorrect to me and should be consistent with the inplace=False case.

So just need to confirm here that DataFrame.where should treat np.nan as a missing value indicator (the current behavior on main) or whether the np.nan should be considered an explicit assignment and the result should be object dtype (since a StringArray cannot hold float values, np.nan is a float).

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Jul 20, 2022
simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Jul 20, 2022
@simonjayhawkins
Copy link
Member Author

I'll bisect to confirm where fixed, but assuming #47763

can confirm. fixed in commit: [1b1dd36] BUG: fix regression in Series[string] setitem setting a scalar with a mask (#47763)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug ExtensionArray Extending pandas with custom dtypes or arrays. Strings String extension data type and string data
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants