Skip to content

BUG: Casting to float32 + groupby leads to different results starting from v1.5.0 #53663

Closed
@MrYugs

Description

@MrYugs

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

df = pd.read_csv("data.csv")
mean_0 = df["foo"].astype("float32").groupby(df["week"]).mean()
mean_1 = df["foo"].groupby(df["week"]).mean()
mean_2 = df["foo"].groupby(df["week"]).mean().astype("float32")
print("%.13f" % mean_0["W10"])
print("%.13f" % mean_1["W10"])
print("%.13f" % mean_2["W10"])

Issue Description

I have different results when combining casting to float32 and groupby + mean aggregation depending on the version of pandas being used. I first identified this change with v2.0.1 then tracked it down to the v1.5.0.

The mean_0 from the example is the only one that changes when testing with v1.4.4 and v1.5.0:

Mean versions <= Pandas 1.4.4 versions >= Pandas 1.5.0
mean_0 288.0459899902344 288.0460205078125
mean_1 288.0459992041385 288.0459992041385
mean_2 288.0459899902344 288.0459899902344

Results seem to differ depending on when the casting is done. I would like to understand why (maybe the loc/iloc change introduced in v1.5.0?).

Here is the input file to reproduce the behavior: data.csv

Expected Behavior

I expect the results to be the same or at least understand the origin of the difference.

Installed Versions

Pandas 1.4 environment:

INSTALLED VERSIONS ------------------ commit : ca60aab python : 3.9.16.final.0 python-bits : 64 OS : Linux OS-release : 5.19.0-42-generic Version : #43~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Apr 21 16:51:08 UTC 2 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.4.4
numpy : 1.22.4
pytz : 2023.3
dateutil : 2.8.2
setuptools : 67.8.0
pip : 23.1.2
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
markupsafe : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 10.0.1
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None

Pandas 1.5 environment:

INSTALLED VERSIONS ------------------ commit : 87cfe4e python : 3.9.16.final.0 python-bits : 64 OS : Linux OS-release : 5.19.0-42-generic Version : #43~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Apr 21 16:51:08 UTC 2 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.5.0
numpy : 1.22.4
pytz : 2023.3
dateutil : 2.8.2
setuptools : 67.7.2
pip : 23.1.2
Cython : 0.29.30
pytest : 7.1.2
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : 1.0.2
psycopg2 : None
jinja2 : 3.1.2
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : 2022.5.0
gcsfs : 2022.5.0
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : 3.0.10
pandas_gbq : None
pyarrow : 10.0.1
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.8.1
snappy : None
sqlalchemy : 1.4.36
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : 2023.3

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions