Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Indexes with arrow arrays covert the arrays to numpy object arrays, and that's very slow #50121

Open
2 of 3 tasks
ivirshup opened this issue Dec 7, 2022 · 5 comments
Open
2 of 3 tasks
Labels
Arrow pyarrow functionality Closing Candidate May be closeable, needs more eyeballs Performance Memory or execution speed performance

Comments

@ivirshup
Copy link
Contributor

ivirshup commented Dec 7, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

import pandas as pd, numpy as np, pyarrow as pa

N = 1_000_000

Construction is fine

%%time
np_bytes = np.arange(N * 2).reshape((N, 2)).view("S16")[:, 0]
pa_bytes = pa.array(np_bytes, type=pa.binary(16))
index = pd.Index(pd.arrays.ArrowExtensionArray(pa_bytes))
CPU times: user 17.9 ms, sys: 10.7 ms, total: 28.6 ms
Wall time: 26.3 ms

is_unique is very slow

%%time
index.is_unique
CPU times: user 2.72 s, sys: 38.2 ms, total: 2.76 s
Wall time: 2.76 s

True

Checking out why:

# Constructing again to reset cache
np_bytes = np.arange(N * 2).reshape((N, 2)).view("S16")[:, 0]
pa_bytes = pa.array(np_bytes, type=pa.binary(16))
index = pd.Index(pd.arrays.ArrowExtensionArray(pa_bytes))
%%prun -s cumtime
index.is_unique
         10000055 function calls (10000054 primitive calls) in 4.515 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    4.515    4.515 {built-in method builtins.exec}
        1    0.000    0.000    4.515    4.515 <string>:1(<module>)
        1    0.086    0.086    4.515    4.515 base.py:2382(is_unique)
        1    0.000    0.000    4.428    4.428 base.py:882(_engine)
        1    0.000    0.000    4.428    4.428 base.py:5159(_get_engine_target)
        1    0.000    0.000    4.428    4.428 base.py:579(astype)
        1    0.181    0.181    4.428    4.428 {built-in method numpy.array}
  1000001    0.507    0.000    4.247    0.000 base.py:415(__iter__)
  1000000    1.692    0.000    3.740    0.000 array.py:285(__getitem__)
  1000000    0.952    0.000    1.282    0.000 utils.py:430(check_array_indexer)

It's all in the contruction of index._engine, which is:

index._engine.values
array([b'\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00',
       b'\x02\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00',
       b'\x04\x00\x00\x00\x00\x00\x00\x00\x05\x00\x00\x00\x00\x00\x00\x00',
       ..., b'z\x84\x1e\x00\x00\x00\x00\x00{\x84\x1e\x00\x00\x00\x00\x00',
       b'|\x84\x1e\x00\x00\x00\x00\x00}\x84\x1e\x00\x00\x00\x00\x00',
       b'~\x84\x1e\x00\x00\x00\x00\x00\x7f\x84\x1e\x00\x00\x00\x00\x00'],
      dtype=object)

A numpy array of python objects, whose construction is very slow.

Installed Versions

INSTALLED VERSIONS

commit : 8dab54d
python : 3.9.12.final.0
python-bits : 64
OS : Darwin
OS-release : 20.6.0
Version : Darwin Kernel Version 20.6.0: Tue Jun 21 20:50:28 PDT 2022; root:xnu-7195.141.32~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8

pandas : 1.5.2
numpy : 1.23.5
pytz : 2021.3
dateutil : 2.8.2
setuptools : 65.6.3
pip : 22.3.1
Cython : 0.29.28
pytest : 7.2.0
hypothesis : None
sphinx : 5.3.0
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.1
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.1.1
IPython : 8.7.0
pandas_datareader: None
bs4 : 4.11.1
bottleneck : None
brotli : None
fastparquet : None
fsspec : 2022.11.0
gcsfs : None
matplotlib : 3.5.3
numba : 0.56.4
numexpr : 2.8.1
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : 10.0.1
pyreadstat : None
pyxlsb : None
s3fs : 2022.11.0
scipy : 1.9.3
snappy : None
sqlalchemy : 1.4.35
tables : 3.7.0
tabulate : 0.8.9
xarray : 2022.12.0
xlrd : 1.2.0
xlwt : None
zstandard : None
tzdata : 2022.1

Prior Performance

Not exactly prior performance, but the same operation 20x faster:

%time len(pa_bytes) == len(pa_bytes.unique())
CPU times: user 142 ms, sys: 10.2 ms, total: 152 ms
Wall time: 151 ms

True
@ivirshup ivirshup added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Dec 7, 2022
@phofl
Copy link
Member

phofl commented Dec 8, 2022

Hi, thanks for your report.

This is known, since there is no arrow engine yet.

@phofl phofl added Closing Candidate May be closeable, needs more eyeballs Arrow pyarrow functionality labels Dec 9, 2022
@ivirshup
Copy link
Contributor Author

I couldn't find an issue specifically for an arrow index engine, is there one already open that can be tracked?

I did find:

But that's a much more general issue. I could see this as being downstream of a "ArrowIndexEngine" feature?

@phofl
Copy link
Member

phofl commented Dec 13, 2022

Yep we don’t have specific issues for that. What would be helpful: a master issue that lists all places where we have to do something to improve arrow support

@lithomas1 lithomas1 removed the Needs Triage Issue that has not been reviewed by a pandas team member label Dec 27, 2022
@ivirshup
Copy link
Contributor Author

ivirshup commented Jul 11, 2023

Update

It looks like performance has considerably improved for this case. Some details below, but I'd be open to closing the issue. TLDR: still converts to objects, no longer very slow.

(1) I think it's a special case for the example provided
(2) Could definitely be faster, but at least the iteration over objects seems to be removed from this case.

import pandas as pd, numpy as np, pyarrow as pa

N = 1_000_000

np_bytes = np.arange(N * 2).reshape((N, 2)).view("S16")[:, 0]
pa_bytes = pa.array(np_bytes, type=pa.binary(16))

%%time
np_index = pd.Index(np_bytes)
# CPU times: user 36 ms, sys: 13.8 ms, total: 49.8 ms
# Wall time: 49.8 ms

%%time
pa_index = pd.Index(pd.arrays.ArrowExtensionArray(pa_bytes))
# CPU times: user 173 µs, sys: 73 µs, total: 246 µs
# Wall time: 198 µs

%%time
np_index.is_unique
# CPU times: user 102 ms, sys: 7.17 ms, total: 109 ms
# Wall time: 110 ms
# Out[6]: True


%%time
pa_index.is_unique
# CPU times: user 172 ms, sys: 28.4 ms, total: 201 ms
# Wall time: 201 ms
# Out[7]: True

It looks to me like the arrow type is being converted to a numpy type which is then computed on. It also looks like this is happening for strings (not shown). So, I believe there is a conversion to object type, but at least the speed is comparable since the array isn't being iterated over in python.

In [10]: %%prun -s cumtime
    ...: pa_index.is_unique

         132 function calls in 0.198 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.198    0.198 {built-in method builtins.exec}
        1    0.000    0.000    0.198    0.198 <string>:1(<module>)
        1    0.123    0.123    0.198    0.198 base.py:2205(is_unique)
        1    0.000    0.000    0.076    0.076 base.py:820(_engine)
        1    0.000    0.000    0.076    0.076 base.py:4963(_get_engine_target)
        1    0.000    0.000    0.076    0.076 base.py:551(astype)
        1    0.013    0.013    0.076    0.076 {built-in method numpy.array}
        1    0.000    0.000    0.062    0.062 array.py:435(__array__)
        1    0.000    0.000    0.062    0.062 array.py:1044(to_numpy)
        1    0.062    0.062    0.062    0.062 {built-in method numpy.asarray}

I can't think of a good reason why an object array should be faster than an arrow array. Which makes me think there is some performance still being left on the table here. However, pyarrow doesn't do better here:

PS, I suspect there is a fair amount of thermal throttling happening during the measurements (summer + zoom call), so don't compare these directly to the above, but broad relationships reproduced

pa_strings = pa.array([f"cell_{i}" for i in range(N)], type=pa.string())
np_strings = pa_strings.to_numpy(zero_copy_only=False)

%timeit pd.Index(pd.arrays.ArrowExtensionArray(pa_strings)).is_unique
# 312 ms ± 3.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit pa.compute.unique(pa_strings)
# 170 ms ± 4.43 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit pd.Index(np_strings).is_unique
# 156 ms ± 870 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

@ivirshup
Copy link
Contributor Author

Also interesting: .is_unique is faster for an index full of bytes objects than a index of fixed length bytes:

import pandas as pd, numpy as np

N = 1_000_000
np_bytes = np.arange(N * 2).reshape((N, 2)).view("S16")[:, 0]
np_objects = np_bytes.astype(object)

pd.Index(np_bytes).is_unique
pd.Index(np_objects).is_unique

%timeit pd.Index(np_bytes).is_unique
# 152 ms ± 3.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit pd.Index(np_objects).is_unique
# 89 ms ± 1.93 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

I believe it's because the bytes are also being copied to objects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Closing Candidate May be closeable, needs more eyeballs Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

3 participants