Skip to content

PERF: GroupBy.mean orders of magnitude slower for pyarrow dtypes. (2.0.3) #54208

Closed
@randolf-scholz

Description

@randolf-scholz

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

It seems that this is not reproducible for everyone, but it happened to me on 2 separate machines as well as in google colab.

If anyone can test and see if it reproduces that would be great.

import numpy as np
import pandas as pd

M, N = 10, 10_000
tol = 0.5

y = np.random.rand(N, M)
y[y > tol] = float("nan")

df_arrow = pd.DataFrame(y, dtype="float32[pyarrow]")
df_arrow.index.name = "time"
df_numpy = df_arrow.convert_dtypes(dtype_backend="numpy_nullable")
%timeit df_numpy.groupby("time").mean();  # 4.02 ms ± 897 µs
%timeit df_arrow.groupby("time").mean();  # 10.3 s ± 661 ms

Related:

Installed Versions

INSTALLED VERSIONS
------------------
commit           : 0f437949513225922d851e9581723d82120684a6
python           : 3.10.6.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.15.109+
Version          : #1 SMP Fri Jun 9 10:57:30 UTC 2023
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 2.0.3
numpy            : 1.22.4
pytz             : 2022.7.1
dateutil         : 2.8.2
setuptools       : 67.7.2
pip              : 23.1.2
Cython           : 0.29.36
pytest           : 7.2.2
hypothesis       : None
sphinx           : 3.5.4
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : 4.9.3
html5lib         : 1.1
pymysql          : None
psycopg2         : 2.9.6
jinja2           : 3.1.2
IPython          : 7.34.0
pandas_datareader: 0.10.0
bs4              : 4.11.2
bottleneck       : None
brotli           : None
fastparquet      : None
fsspec           : 2023.6.0
gcsfs            : 2023.6.0
matplotlib       : 3.7.1
numba            : 0.56.4
numexpr          : 2.8.4
odfpy            : None
openpyxl         : 3.0.10
pandas_gbq       : 0.17.9
pyarrow          : 12.0.1
pyreadstat       : None
pyxlsb           : None
s3fs             : None
scipy            : 1.10.1
snappy           : None
sqlalchemy       : 2.0.18
tables           : 3.8.0
tabulate         : 0.8.10
xarray           : 2022.12.0
xlrd             : 2.0.1
zstandard        : None
tzdata           : 2023.3
qtpy             : None
pyqt5            : None

Prior Performance

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Needs TriageIssue that has not been reviewed by a pandas team memberPerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions