Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Regression with indexing with ExtensionEngine #45652

Closed
3 tasks done
dalejung opened this issue Jan 27, 2022 · 2 comments · Fixed by #46349 or #48472
Closed
3 tasks done

PERF: Regression with indexing with ExtensionEngine #45652

dalejung opened this issue Jan 27, 2022 · 2 comments · Fixed by #46349 or #48472
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Milestone

Comments

@dalejung
Copy link
Contributor

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

import pandas as pd
from itertools import permutations, chain
import string

string_values = chain(
    string.ascii_uppercase,
    permutations(string.ascii_uppercase, 2),
    permutations(string.ascii_uppercase, 3),
)

string_index = pd.Index(map(''.join, string_values)).astype('string')

df = pd.DataFrame({'ints': range(len(string_index))}, index=string_index)

subset_index = string_index[string_index.str.startswith('A')]

# %time slow_result = df.loc[subset_index]
# CPU times: user 1.93 s, sys: 0 ns, total: 1.93 s
# Wall time: 1.93 s
slow_result = df.loc[subset_index]

# %time fast_result = df.loc[subset_index.values]
# CPU times: user 2.85 ms, sys: 0 ns, total: 2.85 ms
# Wall time: 2.82 ms
fast_result = df.loc[subset_index.values]

# results are the same.
pd.testing.assert_frame_equal(slow_result, fast_result)

# Old object indexes don't have this issue.
object_index_df = df.copy()
object_index_df.index = object_index_df.index.astype(object)

# %time obj_result = object_index_df.loc[subset_index]
# CPU times: user 945 µs, sys: 19 µs, total: 964 µs
# Wall time: 939 µs
obj_result = object_index_df.loc[subset_index]

This only happens when indexing using dtype='string' on both the index and the indexer. Note here that df.index and string_index are both dtype='string. What is odd is that just accessing .values or converting to a list will make indexing fast again. Old object indexes don't have the same issue.

Installed Versions

INSTALLED VERSIONS

commit : c5ff649
python : 3.10.0.final.0
python-bits : 64
OS : Linux
OS-release : 5.16.0-arch1-1
Version : #1 SMP PREEMPT Mon, 10 Jan 2022 20:11:47 +0000
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.5.0.dev0+193.gc5ff649b11
numpy : 1.23.0.dev0+512.g6077afd65
pytz : 2021.3
dateutil : 2.8.2
pip : 21.3.1
setuptools : 58.5.3
Cython : 0.29.26
pytest : 6.2.5
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.7.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.3
IPython : 7.29.0
pandas_datareader: None
bs4 : 4.10.0
bottleneck : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : 0.55.0dev0+1077.g0994f97c3
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 7.0.0.dev587+g458271315
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : 2.0.0b1
tables : None
tabulate : 0.8.9
xarray : None
xlrd : None
xlwt : None
zstandard : None

Prior Performance

No response

@dalejung dalejung added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Jan 27, 2022
@rhshadrach rhshadrach added Strings String extension data type and string data Indexing Related to indexing on series/frames, not to indexes themselves labels Jan 27, 2022
@jorisvandenbossche
Copy link
Member

Thanks for the report!

I think this is a regression on the dev branch only, caused by #45514 (I commented there with a somewhat similar example, but using Int64 index instead of string).

What is odd is that just accessing .values or converting to a list will make indexing fast again. Old object indexes don't have the same issue.

That actually seems to be because of a bug, that it is in this case still using the object-dtype based Index/engine and therefore doesn't show the performance regression.
Specifically here:

def _get_indexer_strict(self, key, axis_name: str_t) -> tuple[Index, np.ndarray]:
"""
Analogue to get_indexer that raises if any elements are missing.
"""
keyarr = key
if not isinstance(keyarr, Index):
keyarr = com.asarray_tuplesafe(keyarr)
if self._index_as_unique:
indexer = self.get_indexer_for(keyarr)
keyarr = self.reindex(keyarr)[0]

if the key is not an Index object (which is the case when doing df.loc[subset_index.values], as subset_index.values gives a StringArray), we convert it to a object-dtype array in asarray_tuplesafe (which shouldn't happen), and therefore this gets converted into an object-dtype index, and then when combining the calling Index[string] with the passed Index[object], we cast to the common object dtype, and thus are also using the object-dtype Index's implementation.

@jorisvandenbossche jorisvandenbossche added this to the 1.5 milestone Feb 2, 2022
@jorisvandenbossche jorisvandenbossche removed the Needs Triage Issue that has not been reviewed by a pandas team member label Mar 23, 2022
@jorisvandenbossche
Copy link
Member

Reopening this, since #46349 only added a bandaid specifically for strings, and as I mentioned above, this performance regression also applies to other uses of ExtensionEngine (eg Int64).

@jorisvandenbossche jorisvandenbossche removed the Strings String extension data type and string data label Mar 23, 2022
@mroeschke mroeschke removed this from the 1.5 milestone Aug 15, 2022
@jorisvandenbossche jorisvandenbossche added this to the 1.5 milestone Sep 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Projects
None yet
4 participants