Skip to content

BUG: groupy sum on string column with NA returns results with inconsistent types #53568

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 tasks done
ngoldbaum opened this issue Jun 8, 2023 · 4 comments
Open
3 tasks done
Labels
Bug Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Needs Discussion Requires discussion from core team before further action

Comments

@ngoldbaum
Copy link
Contributor

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
df = pd.DataFrame(
    {
        "A": [1, 1, 2, 2, 3, 3, 1, 4],
        "B": ['B', 'B', pd.NA, pd.NA, 'A', 'A', 'B', 'C'],
        "C": [1, 1, 1, 1, 1, 1, 1, 1],
    }
)
print(df.groupby("A").sum()['B'][2])
print(type(df.groupby("A").sum()['B'][2]))
print(type(df.groupby("A").sum()['B'][3]))

Issue Description

Currently this prints:

0
<class 'int'>
<class 'str'>

Notably if you create the dataframe like this:

df = pd.DataFrame(
    {
        "A": [1, 1, 2, 2, 3, 3, 1, 4],
        "B": pd.array(['B', 'B', pd.NA, pd.NA, 'A', 'A', 'B', 'C'], dtype='string[python]')
        "C": [1, 1, 1, 1, 1, 1, 1, 1],
    }
)

I get back 0 as a string when I do the groupby('A').sum(), as expected.

Expected Behavior

I would expect the test script to print:

'0'
<class 'str'>
<class 'str'>

Installed Versions

INSTALLED VERSIONS

commit : 965ceca
python : 3.10.9.final.0
python-bits : 64
OS : Linux
OS-release : 5.19.0-42-generic
Version : #43~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Apr 21 16:51:08 UTC 2
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.0.2
numpy : 1.25.0rc1+93.g95343a3e6
pytz : 2022.7.1
dateutil : 2.8.2
setuptools : 65.5.0
pip : 23.1.2
Cython : 0.29.35
pytest : 7.3.0
hypothesis : 6.68.2
sphinx : 6.1.3
blosc : 1.11.1
feather : None
xlsxwriter : 3.0.9
lxml.etree : 4.9.2
html5lib : 1.1
pymysql : 1.0.3
psycopg2 : 2.9.6
jinja2 : 3.1.2
IPython : 8.11.0
pandas_datareader: None
bs4 : 4.12.2
bottleneck : 1.3.7
brotli :
fastparquet : 2023.2.0
fsspec : 2023.4.0
gcsfs : 2023.4.0
matplotlib : 3.6.3
numba : None
numexpr : 2.8.4
odfpy : None
openpyxl : 3.1.0
pandas_gbq : None
pyarrow : 11.0.0
pyreadstat : 1.2.1
pyxlsb : 1.0.10
s3fs : 2023.4.0
scipy : 1.10.1
snappy :
sqlalchemy : 2.0.7
tables : 3.8.0
tabulate : 0.9.0
xarray : 2023.4.2
xlrd : 2.0.1
zstandard : 0.21.0
tzdata : 2022.7
qtpy : None
pyqt5 : None

@ngoldbaum ngoldbaum added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 8, 2023
@jorisvandenbossche
Copy link
Member

I think it is actually the other way around? For me, we shouldn't cast the 0 to a string, as the string "0" doesn't make much sense (if we want to have a string here, at least it should be an empty string instead of "0"?)

The 0 result is mostly a side-effect of how sum is implemented for numeric data: 0 as the starting value, and then if the array is empty (after skipping all NAs), you return this starting value.
But this gets a bit strange for object dtype / strings. If you know you have strings, an empty string is probably the better starting value. Or for object dtype (which can contain anything), maybe a better return value would be NA? (although that would be inconsistent with the default min_count).

Note that this is consistent with the non-grouped (numpy-based) sum:

In [4]: pd.Series(["a", pd.NA])[1:].sum()
Out[4]: 0

In [5]: type(_)
Out[5]: int

(for the string extension dtype, we raise an error for "sum")

@ngoldbaum
Copy link
Contributor Author

Ah, I think my misunderstanding originates from assuming that pandas coerced the 'B' column to a string array, but indeed:

In [2]: df['B']
Out[2]: 
0       B
1       B
2    <NA>
3    <NA>
4       A
5       A
6       B
7       C
Name: B, dtype: object

So the types not matching isn't a bug after all, since the two dataframes aren't equivalent.

But this gets a bit strange for object dtype / strings. If you know you have strings, an empty string is probably the better starting value. Or for object dtype (which can contain anything), maybe a better return value would be NA? (although that would be inconsistent with the default min_count).

I agree that for strings an empty string probably makes more sense than '0'. I could also see making both object and string consistent and using NA.

@rhshadrach rhshadrach added Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 12, 2023
@rhshadrach
Copy link
Member

I think this is definitively a bug for string EAs; e.g.

df = pd.DataFrame(
    {
        "A": [1, 1, 2, 2, 3, 3, 1, 4],
        "B": pd.Series(['B', 'B', pd.NA, pd.NA, 'A', 'A', 'B', 'C'], dtype="string[python]"),
        "C": [1, 1, 1, 1, 1, 1, 1, 1],
    }
)
print(df.groupby("A").sum())
#      B  C
# A        
# 1  BBB  3
# 2    0  2
# 3   AA  2
# 4    C  1

The 0 above should be the empty string.

For object dtype, there doesn't seem to me to be a good value. If we agree pd.NA is not appropriate because of min_count, I don't see any better alternative than 0.

@sfc-gh-mvashishtha
Copy link

Or for object dtype (which can contain anything), maybe a better return value would be NA? (although that would be inconsistent with the default min_count).

min_count is documented as The required number of valid values to perform the operation. I don't think it's appropriate to claim that 0 is he sum of an empty list of objects of object dtype, regardless of min_count. In the most common case, I think the column would actually contain str objects, so the user ends up with a perplexing 0 in the groupby result for rows where the values are all null, and some kind of string in the other rows. I think the only appropriate sum is a null of some kind. Maybe we could change the documentation of min_count to say that it only applies to numeric dtypes, in which case if there are no non-null elements to sum, the result is 0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

4 participants