-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: groupy sum on string column with NA returns results with inconsistent types #53568
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think it is actually the other way around? For me, we shouldn't cast the 0 to a string, as the string "0" doesn't make much sense (if we want to have a string here, at least it should be an empty string instead of "0"?) The 0 result is mostly a side-effect of how sum is implemented for numeric data: 0 as the starting value, and then if the array is empty (after skipping all NAs), you return this starting value. Note that this is consistent with the non-grouped (numpy-based) sum:
(for the string extension dtype, we raise an error for "sum") |
Ah, I think my misunderstanding originates from assuming that pandas coerced the
So the types not matching isn't a bug after all, since the two dataframes aren't equivalent.
I agree that for strings an empty string probably makes more sense than |
I think this is definitively a bug for string EAs; e.g.
The For object dtype, there doesn't seem to me to be a good value. If we agree |
|
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
Currently this prints:
Notably if you create the dataframe like this:
I get back 0 as a string when I do the groupby('A').sum(), as expected.
Expected Behavior
I would expect the test script to print:
Installed Versions
INSTALLED VERSIONS
commit : 965ceca
python : 3.10.9.final.0
python-bits : 64
OS : Linux
OS-release : 5.19.0-42-generic
Version : #43~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Apr 21 16:51:08 UTC 2
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 2.0.2
numpy : 1.25.0rc1+93.g95343a3e6
pytz : 2022.7.1
dateutil : 2.8.2
setuptools : 65.5.0
pip : 23.1.2
Cython : 0.29.35
pytest : 7.3.0
hypothesis : 6.68.2
sphinx : 6.1.3
blosc : 1.11.1
feather : None
xlsxwriter : 3.0.9
lxml.etree : 4.9.2
html5lib : 1.1
pymysql : 1.0.3
psycopg2 : 2.9.6
jinja2 : 3.1.2
IPython : 8.11.0
pandas_datareader: None
bs4 : 4.12.2
bottleneck : 1.3.7
brotli :
fastparquet : 2023.2.0
fsspec : 2023.4.0
gcsfs : 2023.4.0
matplotlib : 3.6.3
numba : None
numexpr : 2.8.4
odfpy : None
openpyxl : 3.1.0
pandas_gbq : None
pyarrow : 11.0.0
pyreadstat : 1.2.1
pyxlsb : 1.0.10
s3fs : 2023.4.0
scipy : 1.10.1
snappy :
sqlalchemy : 2.0.7
tables : 3.8.0
tabulate : 0.9.0
xarray : 2023.4.2
xlrd : 2.0.1
zstandard : 0.21.0
tzdata : 2022.7
qtpy : None
pyqt5 : None
The text was updated successfully, but these errors were encountered: