Skip to content

PERF: Indexes with arrow arrays covert the arrays to numpy object arrays, and that's very slow #50121

Open
@ivirshup

Description

@ivirshup

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

import pandas as pd, numpy as np, pyarrow as pa

N = 1_000_000

Construction is fine

%%time
np_bytes = np.arange(N * 2).reshape((N, 2)).view("S16")[:, 0]
pa_bytes = pa.array(np_bytes, type=pa.binary(16))
index = pd.Index(pd.arrays.ArrowExtensionArray(pa_bytes))
CPU times: user 17.9 ms, sys: 10.7 ms, total: 28.6 ms
Wall time: 26.3 ms

is_unique is very slow

%%time
index.is_unique
CPU times: user 2.72 s, sys: 38.2 ms, total: 2.76 s
Wall time: 2.76 s

True

Checking out why:

# Constructing again to reset cache
np_bytes = np.arange(N * 2).reshape((N, 2)).view("S16")[:, 0]
pa_bytes = pa.array(np_bytes, type=pa.binary(16))
index = pd.Index(pd.arrays.ArrowExtensionArray(pa_bytes))
%%prun -s cumtime
index.is_unique
         10000055 function calls (10000054 primitive calls) in 4.515 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    4.515    4.515 {built-in method builtins.exec}
        1    0.000    0.000    4.515    4.515 <string>:1(<module>)
        1    0.086    0.086    4.515    4.515 base.py:2382(is_unique)
        1    0.000    0.000    4.428    4.428 base.py:882(_engine)
        1    0.000    0.000    4.428    4.428 base.py:5159(_get_engine_target)
        1    0.000    0.000    4.428    4.428 base.py:579(astype)
        1    0.181    0.181    4.428    4.428 {built-in method numpy.array}
  1000001    0.507    0.000    4.247    0.000 base.py:415(__iter__)
  1000000    1.692    0.000    3.740    0.000 array.py:285(__getitem__)
  1000000    0.952    0.000    1.282    0.000 utils.py:430(check_array_indexer)

It's all in the contruction of index._engine, which is:

index._engine.values
array([b'\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00',
       b'\x02\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00',
       b'\x04\x00\x00\x00\x00\x00\x00\x00\x05\x00\x00\x00\x00\x00\x00\x00',
       ..., b'z\x84\x1e\x00\x00\x00\x00\x00{\x84\x1e\x00\x00\x00\x00\x00',
       b'|\x84\x1e\x00\x00\x00\x00\x00}\x84\x1e\x00\x00\x00\x00\x00',
       b'~\x84\x1e\x00\x00\x00\x00\x00\x7f\x84\x1e\x00\x00\x00\x00\x00'],
      dtype=object)

A numpy array of python objects, whose construction is very slow.

Installed Versions

INSTALLED VERSIONS

commit : 8dab54d
python : 3.9.12.final.0
python-bits : 64
OS : Darwin
OS-release : 20.6.0
Version : Darwin Kernel Version 20.6.0: Tue Jun 21 20:50:28 PDT 2022; root:xnu-7195.141.32~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8

pandas : 1.5.2
numpy : 1.23.5
pytz : 2021.3
dateutil : 2.8.2
setuptools : 65.6.3
pip : 22.3.1
Cython : 0.29.28
pytest : 7.2.0
hypothesis : None
sphinx : 5.3.0
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.1
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.1.1
IPython : 8.7.0
pandas_datareader: None
bs4 : 4.11.1
bottleneck : None
brotli : None
fastparquet : None
fsspec : 2022.11.0
gcsfs : None
matplotlib : 3.5.3
numba : 0.56.4
numexpr : 2.8.1
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : 10.0.1
pyreadstat : None
pyxlsb : None
s3fs : 2022.11.0
scipy : 1.9.3
snappy : None
sqlalchemy : 1.4.35
tables : 3.7.0
tabulate : 0.8.9
xarray : 2022.12.0
xlrd : 1.2.0
xlwt : None
zstandard : None
tzdata : 2022.1

Prior Performance

Not exactly prior performance, but the same operation 20x faster:

%time len(pa_bytes) == len(pa_bytes.unique())
CPU times: user 142 ms, sys: 10.2 ms, total: 152 ms
Wall time: 151 ms

True

Metadata

Metadata

Assignees

No one assigned

    Labels

    Arrowpyarrow functionalityClosing CandidateMay be closeable, needs more eyeballsPerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions