-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: Indexes with arrow arrays covert the arrays to numpy object arrays, and that's very slow #50121
Comments
Hi, thanks for your report. This is known, since there is no arrow engine yet. |
I couldn't find an issue specifically for an arrow index engine, is there one already open that can be tracked? I did find: But that's a much more general issue. I could see this as being downstream of a "ArrowIndexEngine" feature? |
Yep we don’t have specific issues for that. What would be helpful: a master issue that lists all places where we have to do something to improve arrow support |
UpdateIt looks like performance has considerably improved for this case. Some details below, but I'd be open to closing the issue. TLDR: still converts to objects, no longer very slow. (1) I think it's a special case for the example provided import pandas as pd, numpy as np, pyarrow as pa
N = 1_000_000
np_bytes = np.arange(N * 2).reshape((N, 2)).view("S16")[:, 0]
pa_bytes = pa.array(np_bytes, type=pa.binary(16))
%%time
np_index = pd.Index(np_bytes)
# CPU times: user 36 ms, sys: 13.8 ms, total: 49.8 ms
# Wall time: 49.8 ms
%%time
pa_index = pd.Index(pd.arrays.ArrowExtensionArray(pa_bytes))
# CPU times: user 173 µs, sys: 73 µs, total: 246 µs
# Wall time: 198 µs
%%time
np_index.is_unique
# CPU times: user 102 ms, sys: 7.17 ms, total: 109 ms
# Wall time: 110 ms
# Out[6]: True
%%time
pa_index.is_unique
# CPU times: user 172 ms, sys: 28.4 ms, total: 201 ms
# Wall time: 201 ms
# Out[7]: True It looks to me like the arrow type is being converted to a numpy type which is then computed on. It also looks like this is happening for strings (not shown). So, I believe there is a conversion to object type, but at least the speed is comparable since the array isn't being iterated over in python. In [10]: %%prun -s cumtime
...: pa_index.is_unique
132 function calls in 0.198 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.198 0.198 {built-in method builtins.exec}
1 0.000 0.000 0.198 0.198 <string>:1(<module>)
1 0.123 0.123 0.198 0.198 base.py:2205(is_unique)
1 0.000 0.000 0.076 0.076 base.py:820(_engine)
1 0.000 0.000 0.076 0.076 base.py:4963(_get_engine_target)
1 0.000 0.000 0.076 0.076 base.py:551(astype)
1 0.013 0.013 0.076 0.076 {built-in method numpy.array}
1 0.000 0.000 0.062 0.062 array.py:435(__array__)
1 0.000 0.000 0.062 0.062 array.py:1044(to_numpy)
1 0.062 0.062 0.062 0.062 {built-in method numpy.asarray} I can't think of a good reason why an object array should be faster than an arrow array. Which makes me think there is some performance still being left on the table here. However, pyarrow doesn't do better here: PS, I suspect there is a fair amount of thermal throttling happening during the measurements (summer + zoom call), so don't compare these directly to the above, but broad relationships reproduced pa_strings = pa.array([f"cell_{i}" for i in range(N)], type=pa.string())
np_strings = pa_strings.to_numpy(zero_copy_only=False)
%timeit pd.Index(pd.arrays.ArrowExtensionArray(pa_strings)).is_unique
# 312 ms ± 3.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit pa.compute.unique(pa_strings)
# 170 ms ± 4.43 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit pd.Index(np_strings).is_unique
# 156 ms ± 870 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) |
Also interesting: import pandas as pd, numpy as np
N = 1_000_000
np_bytes = np.arange(N * 2).reshape((N, 2)).view("S16")[:, 0]
np_objects = np_bytes.astype(object)
pd.Index(np_bytes).is_unique
pd.Index(np_objects).is_unique
%timeit pd.Index(np_bytes).is_unique
# 152 ms ± 3.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit pd.Index(np_objects).is_unique
# 89 ms ± 1.93 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) I believe it's because the bytes are also being copied to objects. |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this issue exists on the latest version of pandas.
I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
Construction is fine
is_unique is very slow
Checking out why:
It's all in the contruction of
index._engine
, which is:A numpy array of python objects, whose construction is very slow.
Installed Versions
INSTALLED VERSIONS
commit : 8dab54d
python : 3.9.12.final.0
python-bits : 64
OS : Darwin
OS-release : 20.6.0
Version : Darwin Kernel Version 20.6.0: Tue Jun 21 20:50:28 PDT 2022; root:xnu-7195.141.32~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8
pandas : 1.5.2
numpy : 1.23.5
pytz : 2021.3
dateutil : 2.8.2
setuptools : 65.6.3
pip : 22.3.1
Cython : 0.29.28
pytest : 7.2.0
hypothesis : None
sphinx : 5.3.0
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.1
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.1.1
IPython : 8.7.0
pandas_datareader: None
bs4 : 4.11.1
bottleneck : None
brotli : None
fastparquet : None
fsspec : 2022.11.0
gcsfs : None
matplotlib : 3.5.3
numba : 0.56.4
numexpr : 2.8.1
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : 10.0.1
pyreadstat : None
pyxlsb : None
s3fs : 2022.11.0
scipy : 1.9.3
snappy : None
sqlalchemy : 1.4.35
tables : 3.7.0
tabulate : 0.8.9
xarray : 2022.12.0
xlrd : 1.2.0
xlwt : None
zstandard : None
tzdata : 2022.1
Prior Performance
Not exactly prior performance, but the same operation 20x faster:
The text was updated successfully, but these errors were encountered: