Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
df = pd.read_csv("data.csv")
mean_0 = df["foo"].astype("float32").groupby(df["week"]).mean()
mean_1 = df["foo"].groupby(df["week"]).mean()
mean_2 = df["foo"].groupby(df["week"]).mean().astype("float32")
print("%.13f" % mean_0["W10"])
print("%.13f" % mean_1["W10"])
print("%.13f" % mean_2["W10"])
Issue Description
I have different results when combining casting to float32 and groupby + mean aggregation depending on the version of pandas being used. I first identified this change with v2.0.1 then tracked it down to the v1.5.0.
The mean_0
from the example is the only one that changes when testing with v1.4.4 and v1.5.0:
Mean | versions <= Pandas 1.4.4 | versions >= Pandas 1.5.0 |
---|---|---|
mean_0 | 288.0459899902344 | 288.0460205078125 |
mean_1 | 288.0459992041385 | 288.0459992041385 |
mean_2 | 288.0459899902344 | 288.0459899902344 |
Results seem to differ depending on when the casting is done. I would like to understand why (maybe the loc/iloc change introduced in v1.5.0?).
Here is the input file to reproduce the behavior: data.csv
Expected Behavior
I expect the results to be the same or at least understand the origin of the difference.
Installed Versions
Pandas 1.4 environment:
pandas : 1.4.4
numpy : 1.22.4
pytz : 2023.3
dateutil : 2.8.2
setuptools : 67.8.0
pip : 23.1.2
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
markupsafe : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 10.0.1
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
Pandas 1.5 environment:
pandas : 1.5.0
numpy : 1.22.4
pytz : 2023.3
dateutil : 2.8.2
setuptools : 67.7.2
pip : 23.1.2
Cython : 0.29.30
pytest : 7.1.2
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : 1.0.2
psycopg2 : None
jinja2 : 3.1.2
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : 2022.5.0
gcsfs : 2022.5.0
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : 3.0.10
pandas_gbq : None
pyarrow : 10.0.1
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.8.1
snappy : None
sqlalchemy : 1.4.36
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : 2023.3